⸻ 2026-07-10

Scaling anndata training to the terabyte scale with annbatch

Felix Fischer, Ilan Gold, Fabian Theis, Alex Wolf

The demand for AI in omics has grown at an unprecedented rate, with state-of-the-art models now routinely trained on datasets exceeding the terabyte scale. To make that process more efficient, we developed annbatch,[1] a high-performance data loader built on anndata that enables loading speeds of 60k samples/second and more, at least a factor of 3 higher than the fastest recent alternatives.

⸻ 2026-06-09

Simpler queries for the 2.5B transcriptional profiles of the Arc Virtual Cell Atlas

Sunny Sun, Sergei Rybakov, Frederic Enard, Chaichontat Sriworarat, Alex Wolf

With 2.5B expression profiles that map to about 600M cells, the Arc Virtual Cell Atlas offers the world’s largest collection of uniformly processed scRNA-seq datasets. Arc Institute distributes the atlas as 460k parquet and h5ad files totaling 41TB on Google Cloud Storage. We present a database mirror that offers queries by entities, a graphical user interface, and zero-copy, lineage-aware sharing of datasets.

⸻ 2026-05-12

Re-engineering the PerturBench benchmarking tasks with data lineage

Ishita Jain, Altana Namsaraeva, Sunny Sun, Yan Wu, Alex Wolf

PerturBench (Wu, Wershof, Schmon, Nassar, Osinski, Eksi, Yan, et al., NeurIPS 2025) is a framework for benchmarking machine learning models that predict cellular transcriptional response to perturbations. Its core contributions are benchmarking tasks in the form of curated datasets and definitions of metrics, which are available from GitHub and Hugging Face, albeit without data lineage. To make it easy to see how exactly each dataset came about and assess model performance in light of that context, we re-ran all curation workflows using lineage tracking. We exemplify model training and evaluation, and show equivalence of the re-curated datasets with the originally deposited datasets.

⸻ 2026-04-23

How I managed thousands of datasets to build the scPRINT family of scRNA-seq foundation models

Jeremie Kalfon

At the start of my PhD, I was faced with what seemed like a mountain to climb: build, largely alone, a foundation model for single-cell RNA-seq data. As anyone in the field knows, building the model is not the hard part. Getting the data is.

⸻ 2026-04-15

Managing spatial omics datasets with SpatialData & LaminDB

Lukas Heumos, Altana Namsaraeva, Tim Treis, Mark Keller, Wouter-Michiel Vierdag, Luca Marconato, Lea Zimmermann, Sunny Sun, Alex Wolf

Spatial omics technologies — Xenium, Visium, MERFISH, seqFISH, and others — are generating datasets that combine molecular profiling with spatial coordinates. The SpatialData framework[1] provides a unified format for these heterogeneous datasets: images, segmentation masks, point clouds, shapes, and count tables, all stored in a single .zarr store. But as spatial datasets accumulate across experiments and technologies, managing, querying, and training models on them becomes a major challenge. To address this, we have built native SpatialData support into LaminDB, enabling cross-dataset queries, dataset validation, and lineage tracking.

⸻ 2026-04-02

An introduction to LaminR

Tyler Burns

Any data scientist will tell you that a key to a successful project is strong data management. If your data are disorganized, you don’t know who did what, or you can’t reproduce results, it will come back to bite you and your team. Thus, teams should carefully plan how they handle, store, modify, and track data throughout a project. Here, we’ll be using an example from single-cell analysis to illustrate how the open-source LaminR package helps with traceability and reproducibility of data analyses in R.

⸻ 2026-03-04

A data lakehouse for biology's sparse measurements

Jesse Johnson

One avenue into the future of biotech is scaled learning from multi-modal datasets. Given that the union of these datasets can easily span millions of sparse features, they can’t be queried through any established data infrastructure. Warehouses are too rigid, data lakes can’t be queried, and tabular lakehouses don’t understand the formats. Biology needs a data lakehouse with support for bio-formats and registries.

⸻ 2026-03-02

Interactive visualization of multimodal and spatial data with Vitessce

Mark Keller, Altana Namsaraeva, Alex Wolf, Chaichontat Sriworarat, Sunny Sun

The open-source tool Vitessce and Lamin now work together to manage & visualize multimodal and spatial single-cell data. It’s simple: define a Vitessce config in code, save it as an artifact, and share the interactive visualization along with your datasets on LaminHub.

⸻ 2026-02-27

Symbolic memory for biological R&D

Alex Wolf

What should the shared memory layer for agents and humans look like? Will it live in embeddings or in records? A high-level note.

⸻ 2024-04-03

MappedCollection: Weighted random sampling from large collections of scRNA-seq datasets

Sergei Rybakov, Felix Fischer, Maciek Wiatrak, Ilan Gold, Yanay Rosen, Sunny Sun, Chaichontat Sriworarat, Fabian Theis, Jeremie Kalfon, Alex Wolf

A few labs and companies now train models on large-scale scRNA-seq count matrices and related data modalities. But unlike for many other data types, there isn’t yet a playbook for data scales that don’t fit into memory.

⸻ 2022-08-29

nbproject: Manage Jupyter notebooks

Sergei Rybakov, Lukas Heumos, Alex Wolf

nbproject is an open-source Python tool to help manage Jupyter notebooks with metadata, dependency, and integrity tracking. A draft-to-publish workflow creates more reproducible notebooks with context.

⸻ 2022-08-27

readfcs: Read FCS files

Sunny Sun, Alex Wolf

readfcs is a lightweight open-source Python package that loads data and metadata from Flow Cytometry Standard (FCS) files into DataFrame and AnnData objects, allowing users to flexibly use downstream analytical tools.

⸻ 2022-07-31

Key problems of data-heavy R&D

Sunny Sun, Alex Wolf

The complexity of modern R&D data often blocks realizing the scientific progress it promises.

⸻ 2022-05-04

Hello world!

Sunny Sun, Alex Wolf

We just launched lamin.ai as a place for sharing prototypes with our beta customers and collaborators. Over time, we’ll add public releases and use this blog to explain our work.