scmiracle.data

scmiracle.data#

class scmiracle.data.BasicModDataset[source]#

Bases: Dataset

Base class for modality data.

__getitem__(idx: int) → Any[source]#

Retrieve the data item at the specified index (not implemented in base class).

Parameters:: idx – int The index of the data item.

__len__() → int[source]#

Return the number of samples in the dataset.

Returns:: Number of samples.
Return type:: int

__subset__(indices) → BasicModDataset[source]#

Create a subset of the dataset based on the provided indices.

Parameters:: indices – list List of indices to include in the subset.
Returns:: A new dataset instance containing only the specified indices.
Return type:: BasicModDataset

class scmiracle.data.CSVDataset(csv_file: str, real_dims: list = None, expected_dims: list = None)[source]#

Bases: BasicModDataset

Dataset for csv-based data.

Parameters:

csv_file – str Path to the CSV or compressed CSV file (csv.gz).
real_dims – list, optional A list of integers representing the actual dimensions of the data. Used for padding if it differs from expected_dims.
expected_dims – list, optional A list of integers representing the expected dimensions of the data after padding.

__getitem__(idx: int) → ndarray[source]#

Retrieve the matrix row at the specified index. The data is already padded during initialization.

Parameters:: idx – int The index of the matrix row.
Returns:: The matrix row as a NumPy array.
Return type:: np.ndarray

__len__() → int[source]#

Return the number of rows in the matrix dataset.

Returns:: Number of rows in the dataset.
Return type:: int

__subset__(indices)[source]#

Create a subset of the csv dataset based on the provided indices.

Parameters:: indices – list List of indices to include in the subset.
Returns:: A new CSVDataset instance containing only the specified indices.
Return type:: CSVDataset

class scmiracle.data.MTXDataset(mtx_file: str, real_dims: list = None, expected_dims: list = None)[source]#

Bases: BasicModDataset

Dataset for mtx-based data.

Parameters:

mtx_file – str Path to the mtx file.
real_dims – list, optional A list of integers representing the actual dimensions of the data. Used for padding if it differs from expected_dims.
expected_dims – list, optional A list of integers representing the expected dimensions of the data after padding.

__getitem__(idx: int) → ndarray[source]#

Retrieve the matrix row at the specified index. The data is already padded during initialization.

Parameters:: idx – int The index of the matrix row.
Returns:: The matrix row as a NumPy array.
Return type:: np.ndarray

__len__() → int[source]#

Return the number of rows in the matrix dataset.

Returns:: Number of rows in the dataset.
Return type:: int

__subset__(indices)[source]#

Create a subset of the mtx dataset based on the provided indices.

Parameters:: indices – list List of indices to include in the subset.
Returns:: A new MTXDataset instance containing only the specified indices.
Return type:: MTXDataset

get_all() → ndarray[source]#

Return all data in the dataset as a NumPy array. The data is already padded during initialization.

Returns:: All data in the dataset as a NumPy array.
Return type:: np.ndarray

class scmiracle.data.MultiBatchContinualLearningSampler(data_source: Dataset, shuffle: bool = True, batch_size: int = 1, n_current_datasets: int = 0, n_replay_datasets: int = 0, n_max=10000)[source]#

Bases: Sampler

Custom sampler for multi-batch sampling across multiple datasets with continual learning logic.

Parameters:

data_source – Dataset Concatenated dataset containing replay and current data.
shuffle – bool Whether to shuffle the samples within each dataset for replay, default is True.
batch_size – int Number of samples per batch (for both current and replay), default is 1.
n_current_datasets – int Number of datasets designated as ‘current’ datasets.
n_replay_datasets – int Number of datasets designated as ‘replay’ datasets.

__iter__() → Iterator[int][source]#

Iterate over the dataset indices in a multi-batch continual learning manner, alternating between current and replay datasets.

Returns:: An iterator over sampled indices.
Return type:: Iterator[int]

__len__() → int[source]#

Calculate the total number of samples across all sub-datasets based on the desired sampling strategy. Each “full cycle” involves iterating through all current datasets once, potentially interleaved with replay batches. The length is primarily driven by ensuring all ‘current’ data is processed.

Returns:: The total number of samples.
Return type:: int

class scmiracle.data.MultiBatchSampler(data_source: Dataset, shuffle: bool = True, batch_size: int = 1, n_max: int = 10000)[source]#

Bases: Sampler

Custom sampler for multi-batch sampling across multiple datasets.

Parameters:

data_source – Dataset Dataset.
shuffle – bool Whether to shuffle the samples within each dataset, default is True.
batch_size – int Number of samples per batch, default is 1.
n_max – int Maximum number of samples to draw from each dataset, default is 10000.

__iter__() → Iterator[int][source]#

Iterate over the dataset indices in a multi-batch sampling manner.

Returns:: An iterator over sampled indices.
Return type:: Iterator[int]

__len__() → int[source]#

Calculate the total number of samples across all sub-datasets.

Returns:: The total number of samples.
Return type:: int

class scmiracle.data.MultiModalDataset(mod_dict: Dict[str, str], mod_id_dict: Dict[str, int], file_type: Dict[str, str], mask_path: Dict[str, str] | None = None, transform: Dict[str, str] | None = None, real_dims: Dict[str, list] | None = None, expected_dims: Dict[str, list] | None = None)[source]#

Bases: Dataset

A dataset class for handling multi-modal data with optional masking and transformations.

Parameters:

mod_dict – Dict[str, str] A dictionary mapping modality names to their respective file paths.
mod_id_dict – Dict[str, int] A dictionary mapping modality names to their unique identifiers.
file_type – Dict[str, str] A dictionary mapping modality names to their file types (e.g., ‘vec’, ‘csv’, ‘mtx’).
mask_path – Optional[Dict[str, str]] A dictionary mapping modality names to their mask file paths, default is None.
transform – Optional[Dict[str, str]] A dictionary specifying transformations to apply to each modality, default is None.

__len__()[source]#: Returns the size of the dataset.

__getitem__(idx: int) -> Dict[str, Dict[str, Any]]: Retrieves the data at the given index across all modalities.

__getitem__(idx: int) → Dict[str, Dict[str, Any]][source]#

Retrieves the data at the specified index across all modalities.

Parameters:

idx – int The index of the sample to retrieve.

Returns:

A dictionary containing the following keys:

’x’: Modality data at the given index, with optional transformations applied.
’s’: Modality IDs.
’e’: Masking information, if available.

Return type:

Dict[str, Dict[str, Any]]

__len__() → int[source]#

Returns the size of the dataset.

Returns:: The number of samples in the dataset.
Return type:: int

__subset__(indices) → MultiModalDataset[source]#

Create a subset of the multi-modal dataset based on the provided indices.

Parameters:: indices – list List of indices to include in the subset.
Returns:: A new MultiModalDataset instance containing only the specified indices.
Return type:: MultiModalDataset

class scmiracle.data.MyDistributedSampler(dataset: Dataset, num_replicas: int | None = None, rank: int | None = None, shuffle: bool = True, seed: int = 0, batch_size: int = 256, n_max: int = 10000)[source]#

Bases: DistributedSampler

A custom distributed sampler for datasets split across multiple replicas.

Parameters:

dataset – Dataset The dataset to sample from.
num_replicas – Optional[int] Number of replicas in the distributed setup, default is determined by torch.distributed.
rank – Optional[int] The rank of the current process, default is determined by torch.distributed.
shuffle – bool Whether to shuffle the data, default is True.
seed – int Random seed for shuffling, default is 0.
batch_size – int Number of samples per batch, default is 256.
n_max – int Maximum number of samples per dataset, default is 10000.

__iter__() → Iterator[_T_co][source]#

Iterate over the distributed dataset, ensuring balanced sampling across replicas.

Returns:: Iterator over indices for the current replica.
Return type:: Iterator

__len__() → int[source]#

Calculate the number of samples in the sampler.

Returns:: Number of samples across all datasets.
Return type:: int

class scmiracle.data.VECDataset(path: str, real_dims: list = None, expected_dims: list = None)[source]#

Bases: BasicModDataset

Dataset for vector-based data.

Parameters:: path – str Directory containing vector-based data files.

__getitem__(idx: int) → ndarray[source]#

Retrieve the vector data at the specified index.

Parameters:: idx – int The index of the vector file.
Returns:: The vector data as a NumPy array.
Return type:: np.ndarray

__len__() → int[source]#

Return the number of files in the vector dataset.

Returns:: Number of vector files in the dataset.
Return type:: int

__subset__(indices)[source]#

Create a subset of the vector dataset based on the provided indices.

Parameters:: indices – list List of indices to include in the subset.
Returns:: A new VECDataset instance containing only the specified indices.
Return type:: VECDataset

scmiracle.data.download_data(name: str, des: str = './')[source]#

Downloads the specified dataset and extracts it.

Parameters:

name – str Name of the dataset to download (e.g., ‘teadog_mosaic_4k’).
des – str Destination path to save the dataset (default is the current directory).

scmiracle.data.download_file(url: str, dest_path: str)[source]#

Helper function to download a file from a URL with progress display.

Parameters:

url – str URL for data.
dest_path – str Path to save.

scmiracle.data.download_models(name: str, des: str = './')[source]#

Downloads the specified model.

Parameters:

name – str Name of the model to download (e.g., ‘wnn_mosaic_8batch_mtx’).
des – str Destination path to save the model (default is the current directory).

scmiracle.data.download_script(name: str, des: str = './')[source]#

Downloads the specified script.

Parameters:

name – str Name of the script to download (e.g., ‘wnn_bimodal.R’).
des – str Destination path to save the script (default is the current directory).

scmiracle.data.unzip_file(zip_path: str, extract_to: str)[source]#

Helper function to unzip a file.

Parameters:

zip_path – str Path of zip file.
extract_to – str Path to save.

scmiracle.data

Contents

scmiracle.data#