Tutorial-1: Continual Integration of Mosaic Single-cell Data (RNA+ADT+ATAC)#

Objective:

In this tutorial, we demonstrate how to perform continual integration on a mosaic single-cell dataset spanning eight batches with diverse modality combinations.

Workflow:

Preparation: Set up the environment and download the mosaic dataset.
Base Training: Train an initial model on the first batch (Batch 1).
Continual Integration: Incrementally update the model with incoming data (Batches 2-8) without accessing previous raw data.
(optional) Packaging: Merge datasets.

Step 1: Initialization and Environment Setup#

We start by importing necessary libraries, defining global configurations, and setting a random seed to ensure reproducibility.

[1]:

import warnings
warnings.filterwarnings('ignore')
import os
import lightning as L
import numpy as np
import pandas as pd
import scanpy as sc

# Import custom modules from the scmiracle library
from scmiracle.config import load_config
from scmiracle.data import download_data
from scmiracle.model import MIDAS
from scmiracle.plot import plot_umap
from scmiracle.utils import calculate_loss_scale

# --- Global Configurations ---
# Use os.path.join for cross-platform path compatibility
BASE_DATA_PATH = './'
DATASET_NAME = 'DOTEA_mtx'
DATA_PATH = os.path.join(BASE_DATA_PATH, 'dataset', DATASET_NAME)
SAVE_MODEL_DIR = './'

# --- GPU Configuration ---
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# --- Scanpy Plotting Parameters ---
sc.set_figure_params(figsize=(4, 4))

# --- Global Random Seed ---
# Set a global random seed for reproducibility
L.seed_everything(42)

# --- Dataset and Model Parameters ---
# Number of cells per batch, used for calculating loss weights
NUM_CELLS_PER_BATCH = np.array([9579, 9096, 7318, 4817, 5342, 6181, 4163, 6860])
# Number of workers for data loading; adjust based on your CPU core count
NUM_WORKERS = 64
# Training epochs: 2000 for the base model, 1000 for each incremental step
NUM_EPOCH = [2000] + [1000] * 7

Seed set to 42

Step 2: Data Preparation#

We download the demonstration dataset.

Note on Data Structure: This dataset is pre-processed to simulate an online learning scenario.

RNA & ADT: Features in new batches are aligned with existing features (Reference features first, New features last).
ATAC: Features are consistent across all batches.

[ ]:

# Download the dataset
download_data(DATASET_NAME, BASE_DATA_PATH)

# Load Ground Truth labels for evaluation
labels = [pd.read_csv(os.path.join(DATA_PATH, f'label/P{i+1}_l1.csv'), index_col=0).values.flatten() for i in range(8)]

Step 3: Base Model Training (Batch 1)#

We follow the standard MIDAS procedure to build a robust initial model using only the first batch. This establishes the “base knowledge.”

[3]:

print("\n--- Stage 1: Training the base model on Batch 1 ---")

# Reset the MIDAS model state for a clean start
MIDAS.reset()
configs = load_config()
configs['num_workers'] = NUM_WORKERS

# Configure the data and model from the first batch directory
model = MIDAS.configure_data_from_dir(
    configs=configs,
    dir_path=os.path.join(DATA_PATH, 'data','DOTEA_P1'),
    save_model_path=os.path.join(SAVE_MODEL_DIR, 'DOTAE_P1'),
    format='mtx',
    transform={'atac': 'binarize'} # Binarize ATAC data
)

# Initialize and fit with a Lightning Trainer (approx. 100 mins for 2000 epochs)
trainer = L.Trainer(max_epochs=NUM_EPOCH[0])
trainer.fit(model=model)
# model.load_checkpoint(os.path.join(SAVE_MODEL_DIR, 'saved_models', 'DOTAE_P1_mtx', 'model.pt'))

# Generate predictions and visualize the latent space
pred = model.predict(return_format='anndata')['DOTEA_P1']
pred.obs['label'] = labels[0]
plot_umap(pred, color=['batch', 'label'], wspace=0.4)

INFO:root:The model is initialized with the default configurations.


--- Stage 1: Training the base model on Batch 1 ---

INFO:root:Input data:
          #CELL  #ADT  #ATAC  #VALID_ADT
DOTEA_P1   9579   208  31787         208
INFO:root:Defining new network structure...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 10.5 M | train
1 | dsc  | Discriminator | 38.8 K | train
-----------------------------------------------
10.6 M    Trainable params
0         Non-trainable params
10.6 M    Total params
42.258    Total estimated model params size (MB)
640       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 9579 from 1 datasets.
INFO:root:Using MultiBatchSampler for data loading.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P1/model_epoch500_20260411-022103.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P1/model_epoch1000_20260411-030236.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P1/model_epoch1500_20260411-034423.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P1/model_epoch2000_20260411-042617.pt".
`Trainer.fit` stopped: `max_epochs=2000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P1/model_epoch2000_20260411-042618.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:03<00:00, 11.94it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_6_5.png

Step 4: Continual Integration (Batches 2–8)#

In this stage, we simulate a stream of incoming data. We update the model incrementally with Batches 2 to 8.

Key Concept: calculate_loss_scale is used to balance the model’s plasticity (learning new data) and stability (remembering old data).

[4]:

print("\n--- Stage 2: Continual Training on Batches 2-8 ---")

for step in range(1, 8):
    print(f"\n--- Processing Batch {step + 1} ---")

    # 1. Calculate Loss Scaling
    # Balances the weight of the previous model (old knowledge) vs. current batch (new knowledge)
    scale = calculate_loss_scale(NUM_CELLS_PER_BATCH[:step], NUM_CELLS_PER_BATCH[[step]])

    # 2. Configure continual integration model
    # Note: We use 'configure_new_data_from_dir' to extend the existing model
    miracle = MIDAS.configure_new_data_from_dir(
        configs=configs,
        dir_path=os.path.join(DATA_PATH, 'data', f'DOTEA_P{step+1}'),
        save_model_path=os.path.join(SAVE_MODEL_DIR, f'DOTAE_P{step+1}'),
        format='mtx',
        transform={'atac': 'binarize'},
        scale=scale
    )

    # 3. Update the Model
    trainer = L.Trainer(max_epochs=NUM_EPOCH[step])
    trainer.fit(model=miracle)
    # model.load_checkpoint(os.path.join(SAVE_MODEL_DIR, 'saved_models', f'DOTAE_P{step+1}', 'model.pt'))

    # 4. Visualization
    # Concatenate predictions from all batches processed so far to check integration
    pred = sc.concat(miracle.predict(return_format='anndata'))
    pred.obs['label'] = np.concatenate(labels[:step+1])
    plot_umap(pred, color=['batch', 'label'], wspace=0.4)


--- Stage 2: Continual Training on Batches 2-8 ---

--- Processing Batch 2 ---

INFO:root:Input data:
          #CELL   #ADT  #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  208.0  31787     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787  4052.0         NaN      4052.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 18.8 M | train
1 | dsc  | Discriminator | 51.8 K | train
-----------------------------------------------
18.9 M    Trainable params
0         Non-trainable params
18.9 M    Total params
75.533    Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 18675 from 2 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P2/model_epoch500_20260411-054838.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P2/model_epoch1000_20260411-071043.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P2/model_epoch1000_20260411-071043.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:04<00:00,  9.29it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:04<00:00,  8.46it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_4.png


--- Processing Batch 3 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  208.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  4959.0         NaN      4052.0
DOTEA_P3   7318  208.0      NaN  4959.0       208.0      4063.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 20.7 M | train
1 | dsc  | Discriminator | 52.0 K | train
-----------------------------------------------
20.7 M    Trainable params
0         Non-trainable params
20.7 M    Total params
82.968    Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 25993 from 3 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P3/model_epoch500_20260411-080938.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P3/model_epoch1000_20260411-090817.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P3/model_epoch1000_20260411-090818.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:04<00:00,  8.36it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:04<00:00,  7.73it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:03<00:00,  7.25it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_9.png


--- Processing Batch 4 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  208.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  5439.0         NaN      4052.0
DOTEA_P3   7318  208.0      NaN  5439.0       208.0      4063.0
DOTEA_P4   4817  208.0  31787.0  5439.0       208.0      4050.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 21.7 M | train
1 | dsc  | Discriminator | 52.3 K | train
-----------------------------------------------
21.7 M    Trainable params
0         Non-trainable params
21.7 M    Total params
86.903    Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 30810 from 4 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P4/model_epoch500_20260411-095435.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P4/model_epoch1000_20260411-103852.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P4/model_epoch1000_20260411-103853.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:05<00:00,  7.60it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:04<00:00,  7.36it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:04<00:00,  6.78it/s]
INFO:root:Processing batch DOTEA_P4: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:04<00:00,  4.16it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_14.png


--- Processing Batch 5 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  213.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  5439.0         NaN      4052.0
DOTEA_P3   7318  213.0      NaN  5439.0       208.0      4063.0
DOTEA_P4   4817  213.0  31787.0  5439.0       208.0      4050.0
DOTEA_P5   5342  213.0  31787.0     NaN        45.0         NaN
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 21.7 M | train
1 | dsc  | Discriminator | 52.5 K | train
-----------------------------------------------
21.7 M    Trainable params
0         Non-trainable params
21.7 M    Total params
86.945    Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 36152 from 5 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P5/model_epoch500_20260411-112513.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P5/model_epoch1000_20260411-121040.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P5/model_epoch1000_20260411-121041.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:04<00:00,  7.68it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:04<00:00,  7.29it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:04<00:00,  6.60it/s]
INFO:root:Processing batch DOTEA_P4: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:04<00:00,  3.96it/s]
INFO:root:Processing batch DOTEA_P5: ['adt', 'atac']
100%|██████████| 21/21 [00:04<00:00,  4.53it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_19.png


--- Processing Batch 6 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  213.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  6606.0         NaN      4052.0
DOTEA_P3   7318  213.0      NaN  6606.0       208.0      4063.0
DOTEA_P4   4817  213.0  31787.0  6606.0       208.0      4050.0
DOTEA_P5   5342  213.0  31787.0     NaN        45.0         NaN
DOTEA_P6   6181    NaN  31787.0  6606.0         NaN      4064.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 24.1 M | train
1 | dsc  | Discriminator | 52.7 K | train
-----------------------------------------------
24.1 M    Trainable params
0         Non-trainable params
24.1 M    Total params
96.511    Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 42333 from 6 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P6/model_epoch500_20260411-130601.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P6/model_epoch1000_20260411-140002.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P6/model_epoch1000_20260411-140003.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:05<00:00,  6.85it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:05<00:00,  6.04it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:05<00:00,  5.68it/s]
INFO:root:Processing batch DOTEA_P4: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:05<00:00,  3.62it/s]
INFO:root:Processing batch DOTEA_P5: ['adt', 'atac']
100%|██████████| 21/21 [00:05<00:00,  3.92it/s]
INFO:root:Processing batch DOTEA_P6: ['atac', 'rna']
100%|██████████| 25/25 [00:05<00:00,  4.67it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_24.png


--- Processing Batch 7 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  213.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  7077.0         NaN      4052.0
DOTEA_P3   7318  213.0      NaN  7077.0       208.0      4063.0
DOTEA_P4   4817  213.0  31787.0  7077.0       208.0      4050.0
DOTEA_P5   5342  213.0  31787.0     NaN        45.0         NaN
DOTEA_P6   6181    NaN  31787.0  7077.0         NaN      4064.0
DOTEA_P7   4163  213.0      NaN  7077.0        45.0      4071.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 25.0 M | train
1 | dsc  | Discriminator | 52.9 K | train
-----------------------------------------------
25.1 M    Trainable params
0         Non-trainable params
25.1 M    Total params
100.372   Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 46496 from 7 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P7/model_epoch500_20260411-143356.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P7/model_epoch1000_20260411-150627.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P7/model_epoch1000_20260411-150628.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:05<00:00,  6.73it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:05<00:00,  6.18it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:05<00:00,  5.41it/s]
INFO:root:Processing batch DOTEA_P4: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:05<00:00,  3.56it/s]
INFO:root:Processing batch DOTEA_P5: ['adt', 'atac']
100%|██████████| 21/21 [00:05<00:00,  3.92it/s]
INFO:root:Processing batch DOTEA_P6: ['atac', 'rna']
100%|██████████| 25/25 [00:05<00:00,  4.42it/s]
INFO:root:Processing batch DOTEA_P7: ['adt', 'rna']
100%|██████████| 17/17 [00:04<00:00,  3.47it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_29.png


--- Processing Batch 8 ---

INFO:root:Input data:
          #CELL   #ADT    #ATAC    #RNA  #VALID_ADT  #VALID_RNA
DOTEA_P1   9579  213.0  31787.0     NaN       208.0         NaN
DOTEA_P2   9096    NaN  31787.0  7304.0         NaN      4052.0
DOTEA_P3   7318  213.0      NaN  7304.0       208.0      4063.0
DOTEA_P4   4817  213.0  31787.0  7304.0       208.0      4050.0
DOTEA_P5   5342  213.0  31787.0     NaN        45.0         NaN
DOTEA_P6   6181    NaN  31787.0  7304.0         NaN      4064.0
DOTEA_P7   4163  213.0      NaN  7304.0        45.0      4071.0
DOTEA_P8   6860  213.0  31787.0  7304.0        45.0      4064.0
INFO:root:Loading pre-defined network structure...
INFO:root:Model1 updated successfully with weights from model2 (left-aligned strategy).
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type          | Params | Mode
-----------------------------------------------
0 | net  | VAE           | 25.5 M | train
1 | dsc  | Discriminator | 53.1 K | train
-----------------------------------------------
25.6 M    Trainable params
0         Non-trainable params
25.6 M    Total params
102.233   Total estimated model params size (MB)
666       Modules in train mode
0         Modules in eval mode
INFO:root:Total number of samples: 53356 from 8 datasets.
INFO:root:DataLoader created with batch size 256 and 64 workers.

INFO:root:Checkpoint successfully saved to "./DOTAE_P8/model_epoch500_20260411-161143.pt".
INFO:root:Checkpoint successfully saved to "./DOTAE_P8/model_epoch1000_20260411-171546.pt".
`Trainer.fit` stopped: `max_epochs=1000` reached.
INFO:root:Checkpoint successfully saved to "./DOTAE_P8/model_epoch1000_20260411-171547.pt".
INFO:root:Predicting using device: cuda
INFO:root:Predicting ...
INFO:root:Processing batch DOTEA_P1: ['adt', 'atac']
100%|██████████| 38/38 [00:06<00:00,  6.13it/s]
INFO:root:Processing batch DOTEA_P2: ['atac', 'rna']
100%|██████████| 36/36 [00:06<00:00,  5.83it/s]
INFO:root:Processing batch DOTEA_P3: ['adt', 'rna']
100%|██████████| 29/29 [00:05<00:00,  5.05it/s]
INFO:root:Processing batch DOTEA_P4: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:06<00:00,  3.14it/s]
INFO:root:Processing batch DOTEA_P5: ['adt', 'atac']
100%|██████████| 21/21 [00:05<00:00,  3.58it/s]
INFO:root:Processing batch DOTEA_P6: ['atac', 'rna']
100%|██████████| 25/25 [00:06<00:00,  4.13it/s]
INFO:root:Processing batch DOTEA_P7: ['adt', 'rna']
100%|██████████| 17/17 [00:05<00:00,  3.08it/s]
INFO:root:Processing batch DOTEA_P8: ['adt', 'atac', 'rna']
100%|██████████| 27/27 [00:06<00:00,  4.33it/s]
... storing 'batch' as categorical
... storing 'label' as categorical

../../_images/tutorials_basics_demo1_8_34.png

Step 5: Packaging for Future Use#

After integrating all batches, we can “pack” the unified dataset. This creates a reference that can be loaded directly in future sessions, skipping the incremental training steps.

5.1 Pack the Data#

[5]:

# Save the unified dataset to a new directory
model.pack_data(os.path.join(DATA_PATH, 'data', 'DOTEA_miracle'))
print(f"Data successfully packed to {os.path.join(DATA_PATH, 'data','DOTEA_miracle')}")

INFO:root:Processing batch 0: ['adt', 'atac']
100%|██████████| 38/38 [00:05<00:00,  6.57it/s]
INFO:root:Saving concatenated data for batch 0...
INFO:root:Processing batch 1: ['atac', 'rna']
100%|██████████| 36/36 [00:05<00:00,  6.21it/s]
INFO:root:Saving concatenated data for batch 1...
INFO:root:Processing batch 2: ['adt', 'rna']
100%|██████████| 29/29 [00:05<00:00,  4.88it/s]
INFO:root:Saving concatenated data for batch 2...
INFO:root:Processing batch 3: ['adt', 'atac', 'rna']
100%|██████████| 19/19 [00:05<00:00,  3.24it/s]
INFO:root:Saving concatenated data for batch 3...
INFO:root:Processing batch 4: ['adt', 'atac']
100%|██████████| 21/21 [00:05<00:00,  3.56it/s]
INFO:root:Saving concatenated data for batch 4...
INFO:root:Processing batch 5: ['atac', 'rna']
100%|██████████| 25/25 [00:05<00:00,  4.23it/s]
INFO:root:Saving concatenated data for batch 5...
INFO:root:Processing batch 6: ['adt', 'rna']
100%|██████████| 17/17 [00:06<00:00,  2.81it/s]
INFO:root:Saving concatenated data for batch 6...
INFO:root:Processing batch 7: ['adt', 'atac', 'rna']
100%|██████████| 27/27 [00:06<00:00,  4.36it/s]
INFO:root:Saving concatenated data for batch 7...

Data successfully packed to ./dataset/DOTEA_mtx/data/DOTEA_miracle

5.2 (Optional) How to Reuse Packed Data#

The following code snippet demonstrates how to load this packed dataset in a new Python session.

[6]:

# --- Example Usage in a New Session ---

# MIDAS.reset()
# configs = load_config()
# configs['num_workers'] = NUM_WORKERS

# # 1. Initialize directly from the packed directory
# model = MIDAS.configure_data_from_dir(
#     configs=configs,
#     dir_path=os.path.join(DATA_PATH, 'data', 'DOTEA_miracle'),
#     save_model_path='your_path_to_save_model',
#     format='mtx',
#     transform={'atac': 'binarize'}
# )

# # 2. Load the weights from the final checkpoint of Stage 2
# model.load_checkpoint('your_path_of_checkpoint')

# # 3. Ready for further analysis or adding Batch 9...
# miracle = MIDAS.configure_new_data_from_dir(
#     configs=configs,
#     dir_path='your_path_of_new_data',
#     # ... parameters ...
# )

Tutorial-1: Continual Integration of Mosaic Single-cell Data (RNA+ADT+ATAC)

Contents