## Re-engineering the PerturBench benchmarking tasks with data lineage

PerturBench (Wu, Wershof, Schmon, Nassar, Osinski, Eksi, Yan, et al.,
NeurIPS 2025) is a framework for benchmarking machine learning models
that predict cellular transcriptional response to perturbations. Its
core contributions are benchmarking tasks in the form of curated
datasets and definitions of metrics, which are available from GitHub
and Hugging Face, albeit without data lineage. To make it easy to see
how exactly each dataset came about and assess model performance in
light of that context, we re-ran all curation workflows using lineage
tracking. We exemplify model training and evaluation, and show
equivalence of the re-curated datasets with the originally deposited
datasets.

While the situation has been improving in recent years through efforts
like PerturBench[1], published scRNA-seq-based models have often been
evaluated on inconsistent benchmarks, making it hard to know what
works and going counter to the fact that machine learning
breakthroughs need well-curated datasets and well-defined tasks.
PerturBench is one of several efforts in the field and was preceded by
an Open Problems benchmark, published at NeurIPS 2024[2]. Another
example for a similar benchmarking effort is last year's Arc Virtual
Cell Challenge. For a video introduction to PerturBench, you can watch
this episode of Valence Labs' MultiOmics Reading Group.

The NeurIPS PerturBench submission hosts its six curated benchmarking
datasets on Hugging Face. These files, however, don't reveal how the
curation was done.

To make data lineage easy to browse and understand, we re-ran all
curation steps with "ln.track()" added to the source code. You can
explore the result by clicking on the link in the "Dataset + lineage"
column of the following table.

| --- | --- | --- | --- |
| Reference | Perturbation | *N* | Dataset + lineage (click |
| the image to explore) |
| =========================== | =========================== | =========================== | =========================== |
| Norman19[3] | Genetic | 91,168 cells |
| --- | --- | --- | --- |
| Srivatsan20[4] | Chemical | 178,213 cells |
| --- | --- | --- | --- |
| Frangieh21[5] | Genetic | 218,331 cells |
| --- | --- | --- | --- |
| McFaline24[6] | Genetic | 892,800 cells |
| --- | --- | --- | --- |
| Jiang25[7] | Genetic | 1,628,476 cells |
| --- | --- | --- | --- |
| Szalata24 (OP3)[2] | Chemical | 298,087 cells |
| --- | --- | --- | --- |

On a high level, the data flow is:

1. Raw data ingestion: Each dataset has a dedicated notebook (prefixed
 with "ingestion_") that ingests raw datasets.

2. Curation: Curation notebooks (prefixed with "curate_") handle
 format conversion, preprocessing, metadata harmonization, and split
 generation.

3. Training and eval: Curated datasets are loaded to train and
 evaluate models using the "PerturBench" Python framework.

For a comparison that shows the equivalence of the original datasets
and the re-curated datasets, explore
altoslabs/perturbench/transform/3bZAUr0kXokI. For a full training and
model evaluation run, explore
altoslabs/perturbench/transform/9dPgCiisCm1w.

This post was motivated by the desire to reproduce PerturBench's
training and eval results in a file-centric manner, omitting the
detailed modeling of perturbational & biological metadata. Modeling
and validating perturbations will be the topic of an upcoming post.

# Author contributions

"*" These authors contributed equally.

Ishita & Altana performed the computational work.

Sunny & Yan advised on the project. Yan developed the original
curation notebooks & scripts together with the authors of the original
publication.

Alex supervised the project.

# Code & data availability

Database: lamin.ai/altoslabs/perturbench. Repo:
github.com/altoslabs/perturbench.

# How to cite

 Jain I, Namsaraeva A, Sun S, Wu Y & Wolf A (2026). Re-engineering the PerturBench benchmarking tasks with data lineage. Lamin Blog. https://blog.lamin.ai/perturbench

# References

---

[1] Wu Y, Wershof E, Schmon SM, Nassar M, Osiński B, Eksi R, Yan Z,
 Stark R, Zhang K & Graepel T (2025). PerturBench: Benchmarking
 Machine Learning Models for Cellular Perturbation Analysis.
 NeurIPS.

[2] Szałata A, Benz A, Cannoodt R, Cortes M, Fong J, Kuppasani S,
 Lieberman R, Liu T, Mas-Rosario JA, Meinl R, Nourisa J, Tumiel J,
 Tunjic TM, Wang M, Weber N, Zhao H, Anchang B, Theis FJ, Luecken
 MD & Burkhardt DB (2024). A Benchmark for Prediction of
 Transcriptomic Responses to Chemical Perturbations Across Cell
 Types. NeurIPS.

[3] Norman TM, Horlbeck MA, Replogle JM, Ge AY, Xu A, Jost M, Gilbert
 LA & Weissman JS (2019). Exploring genetic interaction manifolds
 constructed from rich single-cell phenotypes. Science.

[4] Srivatsan SR, McFaline-Figueroa JL, Ramani V, Saunders L, Cao J,
 Packer J, Pliner HA, Jackson DL, Daza RM, Christiansen L, Zhang
 DA, Steemers F, Shendure J & Trapnell C (2020). Massively
 multiplex chemical transcriptomics at single-cell resolution.
 Science.

[5] Frangieh CJ, Melms JC, Thakore PI, Geiger-Schuller KR, Ho P, Luoma
 AM, Cleary B, Jerby-Arnon L, Garg S, Regev A & Izar B (2021).
 Multimodal pooled Perturb-CITE-seq screens in patient models
 define mechanisms of cancer immune evasion. Nature Genetics.

[6] McFaline-Figueroa JL, Srivatsan SR, Hill AJ, Gasperini M, Jackson
 DL, Saunders L, Domcke S, Regalado SG, Lazarchuck P, Alvarez S,
 Monnat RJ Jr, Shendure J & Trapnell C (2024). Multiplex single-
 cell chemical genomics reveals the kinase dependence of the
 response to targeted therapy. Cell Genomics.

[7] Jiang L, Dalgarno C, Papalexi E, Mascio I, Wessels HH, Yun H &
 Satija R (2025). Systematic reconstruction of molecular pathway
 signatures using scalable single-cell perturbation screens. Nature
 Cell Biology.