scmiracle.data#
- class scmiracle.data.BasicModDataset[source]#
Bases:
DatasetBase class for modality data.
- __getitem__(idx: int) Any[source]#
Retrieve the data item at the specified index (not implemented in base class).
- Parameters:
idx – int The index of the data item.
- __len__() int[source]#
Return the number of samples in the dataset.
- Returns:
Number of samples.
- Return type:
int
- __subset__(indices) BasicModDataset[source]#
Create a subset of the dataset based on the provided indices.
- Parameters:
indices – list List of indices to include in the subset.
- Returns:
A new dataset instance containing only the specified indices.
- Return type:
- class scmiracle.data.CSVDataset(csv_file: str, real_dims: list = None, expected_dims: list = None)[source]#
Bases:
BasicModDatasetDataset for csv-based data.
- Parameters:
csv_file – str Path to the CSV or compressed CSV file (csv.gz).
real_dims – list, optional A list of integers representing the actual dimensions of the data. Used for padding if it differs from expected_dims.
expected_dims – list, optional A list of integers representing the expected dimensions of the data after padding.
- __getitem__(idx: int) ndarray[source]#
Retrieve the matrix row at the specified index. The data is already padded during initialization.
- Parameters:
idx – int The index of the matrix row.
- Returns:
The matrix row as a NumPy array.
- Return type:
np.ndarray
- __len__() int[source]#
Return the number of rows in the matrix dataset.
- Returns:
Number of rows in the dataset.
- Return type:
int
- class scmiracle.data.MTXDataset(mtx_file: str, real_dims: list = None, expected_dims: list = None)[source]#
Bases:
BasicModDatasetDataset for mtx-based data.
- Parameters:
mtx_file – str Path to the mtx file.
real_dims – list, optional A list of integers representing the actual dimensions of the data. Used for padding if it differs from expected_dims.
expected_dims – list, optional A list of integers representing the expected dimensions of the data after padding.
- __getitem__(idx: int) ndarray[source]#
Retrieve the matrix row at the specified index. The data is already padded during initialization.
- Parameters:
idx – int The index of the matrix row.
- Returns:
The matrix row as a NumPy array.
- Return type:
np.ndarray
- __len__() int[source]#
Return the number of rows in the matrix dataset.
- Returns:
Number of rows in the dataset.
- Return type:
int
- class scmiracle.data.MultiBatchContinualLearningSampler(data_source: Dataset, shuffle: bool = True, batch_size: int = 1, n_current_datasets: int = 0, n_replay_datasets: int = 0, n_max=10000)[source]#
Bases:
SamplerCustom sampler for multi-batch sampling across multiple datasets with continual learning logic.
- Parameters:
data_source – Dataset Concatenated dataset containing replay and current data.
shuffle – bool Whether to shuffle the samples within each dataset for replay, default is True.
batch_size – int Number of samples per batch (for both current and replay), default is 1.
n_current_datasets – int Number of datasets designated as ‘current’ datasets.
n_replay_datasets – int Number of datasets designated as ‘replay’ datasets.
- __iter__() Iterator[int][source]#
Iterate over the dataset indices in a multi-batch continual learning manner, alternating between current and replay datasets.
- Returns:
An iterator over sampled indices.
- Return type:
Iterator[int]
- __len__() int[source]#
Calculate the total number of samples across all sub-datasets based on the desired sampling strategy. Each “full cycle” involves iterating through all current datasets once, potentially interleaved with replay batches. The length is primarily driven by ensuring all ‘current’ data is processed.
- Returns:
The total number of samples.
- Return type:
int
- class scmiracle.data.MultiBatchSampler(data_source: Dataset, shuffle: bool = True, batch_size: int = 1, n_max: int = 10000)[source]#
Bases:
SamplerCustom sampler for multi-batch sampling across multiple datasets.
- Parameters:
data_source – Dataset Dataset.
shuffle – bool Whether to shuffle the samples within each dataset, default is True.
batch_size – int Number of samples per batch, default is 1.
n_max – int Maximum number of samples to draw from each dataset, default is 10000.
- class scmiracle.data.MultiModalDataset(mod_dict: Dict[str, str], mod_id_dict: Dict[str, int], file_type: Dict[str, str], mask_path: Dict[str, str] | None = None, transform: Dict[str, str] | None = None, real_dims: Dict[str, list] | None = None, expected_dims: Dict[str, list] | None = None)[source]#
Bases:
DatasetA dataset class for handling multi-modal data with optional masking and transformations.
- Parameters:
mod_dict – Dict[str, str] A dictionary mapping modality names to their respective file paths.
mod_id_dict – Dict[str, int] A dictionary mapping modality names to their unique identifiers.
file_type – Dict[str, str] A dictionary mapping modality names to their file types (e.g., ‘vec’, ‘csv’, ‘mtx’).
mask_path – Optional[Dict[str, str]] A dictionary mapping modality names to their mask file paths, default is None.
transform – Optional[Dict[str, str]] A dictionary specifying transformations to apply to each modality, default is None.
- __getitem__(idx
int) -> Dict[str, Dict[str, Any]]: Retrieves the data at the given index across all modalities.
- __getitem__(idx: int) Dict[str, Dict[str, Any]][source]#
Retrieves the data at the specified index across all modalities.
- Parameters:
idx – int The index of the sample to retrieve.
- Returns:
- A dictionary containing the following keys:
’x’: Modality data at the given index, with optional transformations applied.
’s’: Modality IDs.
’e’: Masking information, if available.
- Return type:
Dict[str, Dict[str, Any]]
- __len__() int[source]#
Returns the size of the dataset.
- Returns:
The number of samples in the dataset.
- Return type:
int
- __subset__(indices) MultiModalDataset[source]#
Create a subset of the multi-modal dataset based on the provided indices.
- Parameters:
indices – list List of indices to include in the subset.
- Returns:
A new MultiModalDataset instance containing only the specified indices.
- Return type:
- class scmiracle.data.MyDistributedSampler(dataset: Dataset, num_replicas: int | None = None, rank: int | None = None, shuffle: bool = True, seed: int = 0, batch_size: int = 256, n_max: int = 10000)[source]#
Bases:
DistributedSamplerA custom distributed sampler for datasets split across multiple replicas.
- Parameters:
dataset – Dataset The dataset to sample from.
num_replicas – Optional[int] Number of replicas in the distributed setup, default is determined by torch.distributed.
rank – Optional[int] The rank of the current process, default is determined by torch.distributed.
shuffle – bool Whether to shuffle the data, default is True.
seed – int Random seed for shuffling, default is 0.
batch_size – int Number of samples per batch, default is 256.
n_max – int Maximum number of samples per dataset, default is 10000.
- class scmiracle.data.VECDataset(path: str, real_dims: list = None, expected_dims: list = None)[source]#
Bases:
BasicModDatasetDataset for vector-based data.
- Parameters:
path – str Directory containing vector-based data files.
- __getitem__(idx: int) ndarray[source]#
Retrieve the vector data at the specified index.
- Parameters:
idx – int The index of the vector file.
- Returns:
The vector data as a NumPy array.
- Return type:
np.ndarray
- __len__() int[source]#
Return the number of files in the vector dataset.
- Returns:
Number of vector files in the dataset.
- Return type:
int
- scmiracle.data.download_data(name: str, des: str = './')[source]#
Downloads the specified dataset and extracts it.
- Parameters:
name – str Name of the dataset to download (e.g., ‘teadog_mosaic_4k’).
des – str Destination path to save the dataset (default is the current directory).
- scmiracle.data.download_file(url: str, dest_path: str)[source]#
Helper function to download a file from a URL with progress display.
- Parameters:
url – str URL for data.
dest_path – str Path to save.
- scmiracle.data.download_models(name: str, des: str = './')[source]#
Downloads the specified model.
- Parameters:
name – str Name of the model to download (e.g., ‘wnn_mosaic_8batch_mtx’).
des – str Destination path to save the model (default is the current directory).