Re-engineering the PerturBench benchmarking tasks with data lineage ¶

Ishita Jain*, Altana Namsaraeva*, Sunny Sun, Yan Wu, Alex Wolf

PerturBench (Wu, Wershof, Schmon, Nassar, Osinski, Eksi, Yan, et al., NeurIPS 2025) is a framework for benchmarking machine learning models that predict cellular transcriptional response to perturbations. Its core contributions are benchmarking tasks in the form of curated datasets and definitions of metrics, which are available from GitHub and Hugging Face, albeit without data lineage. To make it easy to see how exactly each dataset came about and assess model performance in light of that context, we re-ran all curation workflows using lineage tracking. We exemplify model training and evaluation, and show equivalence of the re-curated datasets with the originally deposited datasets.

While the situation has been improving in recent years through efforts like PerturBench[1], published scRNA-seq-based models have often been evaluated on inconsistent benchmarks, making it hard to know what works and going counter to the fact that machine learning breakthroughs need well-curated datasets and well-defined tasks. PerturBench is one of several efforts in the field and was preceded by an Open Problems benchmark, published at NeurIPS 2024[2]. Another example for a similar benchmarking effort is last year’s Arc Virtual Cell Challenge. For a video introduction to PerturBench, you can watch this episode of Valence Labs’ MultiOmics Reading Group.

The NeurIPS PerturBench submission hosts its six curated benchmarking datasets on Hugging Face. These files, however, don’t reveal how the curation was done.

To make data lineage easy to browse and understand, we re-ran all curation steps with ln.track() added to the source code. You can explore the result by clicking on the link in the “Dataset + lineage” column of the following table.

Reference	Perturbation	N
Norman19[3]	Genetic	91,168 cells
Srivatsan20[4]	Chemical	178,213 cells
Frangieh21[5]	Genetic	218,331 cells
McFaline24[6]	Genetic	892,800 cells
Jiang25[7]	Genetic	1,628,476 cells
Szalata24 (OP3)[2]	Chemical	298,087 cells

On a high level, the data flow is:

Raw data ingestion: Each dataset has a dedicated notebook (prefixed with ingestion_) that ingests raw datasets.
Curation: Curation notebooks (prefixed with curate_) handle format conversion, preprocessing, metadata harmonization, and split generation.
Training and eval: Curated datasets are loaded to train and evaluate models using the PerturBench Python framework.

For a comparison that shows the equivalence of the original datasets and the re-curated datasets, explore altoslabs/perturbench/transform/3bZAUr0kXokI. For a full training and model evaluation run, explore altoslabs/perturbench/transform/9dPgCiisCm1w.

This post was motivated by the desire to reproduce PerturBench’s training and eval results in a file-centric manner, omitting the detailed modeling of perturbational & biological metadata. Modeling and validating perturbations will be the topic of an upcoming post.

Author contributions¶

* These authors contributed equally.

Ishita & Altana performed the computational work.

Sunny & Yan advised on the project. Yan developed the original curation notebooks & scripts together with the authors of the original publication.

Alex supervised the project.

Code & data availability¶

Database: lamin.ai/altoslabs/perturbench. Repo: github.com/altoslabs/perturbench.

How to cite¶

Jain I, Namsaraeva A, Sun S, Wu Y & Wolf A (2026). Re-engineering the PerturBench benchmarking tasks with data lineage. Lamin Blog. https://blog.lamin.ai/perturbench

Re-engineering the PerturBench benchmarking tasks with data lineage ¶

Author contributions¶

Code & data availability¶

How to cite¶

References¶