Re-engineering the PerturBench benchmarking tasks with data lineage .md

PerturBench (Wu, Wershof, Schmon, Nassar, Osinski, Eksi, Yan, et al., NeurIPS 2025) is a framework for benchmarking machine learning models that predict cellular transcriptional response to perturbations. Its core contributions are benchmarking tasks in the form of curated datasets and definitions of metrics, which are available from GitHub and Hugging Face, albeit without data lineage. To make it easy to see how exactly each dataset came about and assess model performance in light of that context, we re-ran all curation workflows using lineage tracking. We exemplify model training and evaluation, and show equivalence of the re-curated datasets with the originally deposited datasets.

While the situation has been improving in recent years through efforts like PerturBench[1], published scRNA-seq-based models have often been evaluated on inconsistent benchmarks, making it hard to know what works and going counter to the fact that machine learning breakthroughs need well-curated datasets and well-defined tasks. PerturBench is one of several efforts in the field and was preceded by an Open Problems benchmark, published at NeurIPS 2024[2]. Another example for a similar benchmarking effort is last year’s Arc Virtual Cell Challenge. For a video introduction to PerturBench, you can watch this episode of Valence Labs’ MultiOmics Reading Group.

The NeurIPS PerturBench submission hosts its six curated benchmarking datasets on Hugging Face. These files, however, don’t reveal how the curation was done.

To make data lineage easy to browse and understand, we re-ran all curation steps with ln.track() added to the source code. You can explore the result by clicking on the link in the “Dataset + lineage” column of the following table.

Reference

Perturbation

N

Dataset + lineage (click the image to explore)

Norman19[3]

Genetic

91,168 cells

Lineage

Srivatsan20[4]

Chemical

178,213 cells

Lineage

Frangieh21[5]

Genetic

218,331 cells

Lineage

McFaline24[6]

Genetic

892,800 cells

Lineage

Jiang25[7]

Genetic

1,628,476 cells

Lineage

Szalata24 (OP3)[2]

Chemical

298,087 cells

Lineage

On a high level, the data flow is:

  1. Raw data ingestion: Each dataset has a dedicated notebook (prefixed with ingestion_) that ingests raw datasets.

  2. Curation: Curation notebooks (prefixed with curate_) handle format conversion, preprocessing, metadata harmonization, and split generation.

  3. Training and eval: Curated datasets are loaded to train and evaluate models using the PerturBench Python framework.

For a comparison that shows the equivalence of the original datasets and the re-curated datasets, explore altoslabs/perturbench/transform/3bZAUr0kXokI. For a full training and model evaluation run, explore altoslabs/perturbench/transform/9dPgCiisCm1w.

This post was motivated by the desire to reproduce PerturBench’s training and eval results in a file-centric manner, omitting the detailed modeling of perturbational & biological metadata. Modeling and validating perturbations will be the topic of an upcoming post.

Author contributions

* These authors contributed equally.

Ishita & Altana performed the computational work.

Sunny & Yan advised on the project. Yan developed the original curation notebooks & scripts together with the authors of the original publication.

Alex supervised the project.

Code & data availability

Database: lamin.ai/altoslabs/perturbench. Repo: github.com/altoslabs/perturbench.

How to cite

Jain I, Namsaraeva A, Sun S, Wu Y & Wolf A (2026). Re-engineering the PerturBench benchmarking tasks with data lineage. Lamin Blog. https://blog.lamin.ai/perturbench

References