Simpler queries for the 2.5B transcriptional profiles of the Arc Virtual Cell Atlas .md

With 2.5B expression profiles that map to about 600M cells, the Arc Virtual Cell Atlas offers the world’s largest collection of uniformly processed scRNA-seq datasets. Arc Institute distributes the atlas as 460k parquet and h5ad files totaling 41TB on Google Cloud Storage. We present a database mirror that offers queries by entities, a graphical user interface, and zero-copy, lineage-aware sharing of datasets.

For example, you might want to find datasets for human brain samples linked to glioblastoma that were processed with a certain pipeline. In the original atlas,[1] this requires scanning directories and parquet files. In a database, you can express queries through the entities you care about: the organism, tissue, disease, and the processing pipeline. The screenshot shows how this works on lamin.ai/laminlabs/arc-virtual-cell-atlas:

LaminHub artifacts page filtered by organism and tissue metadata

The same query can also be expressed in Python or R:

import lamindb as ln

db = ln.DB("laminlabs/arc-virtual-cell-atlas")

scbase = db.Project.get(name="scBaseCount")
gbm = db.bionty.Disease.get(name="glioblastoma multiforme")
brain = db.bionty.Tissue.get(name="brain")
human = db.bionty.Organism.get(name="human")
factors = db.bionty.ExperimentalFactor.filter(name__in=["10x_Genomics", "3_prime_gex"])
genefull = db.ULabel.get(name="GeneFull_Ex50pAS")

datasets = db.Artifact.filter(
    projects=scbase,
    diseases=gbm,
    tissues=brain,
    organisms=human,
    experimental_factors__in=factors,
    ulabels=genefull,
    is_latest=True,
)
library(laminr)
ln <- laminr::import_module("lamindb")

db <- ln$DB("laminlabs/arc-virtual-cell-atlas")

scbase <- db$Project$get(name = "scBaseCount")
gbm <- db$bionty$Disease$get(name = "glioblastoma multiforme")
brain <- db$bionty$Tissue$get(name = "brain")
human <- db$bionty$Organism$get(name = "human")
factors <- db$bionty$ExperimentalFactor$filter(name__in = c("10x_Genomics", "3_prime_gex"))
genefull <- db$ULabel$get(name = "GeneFull_Ex50pAS")

datasets <- db$Artifact$filter(
  projects = scbase,
  diseases = gbm,
  tissues = brain,
  organisms = human,
  experimental_factors__in = factors,
  ulabels = genefull,
  is_latest = TRUE
)

Queried datasets can then be transferred, loaded, cached, or streamed for cell-level slicing:

first_dataset = datasets[0]
first_dataset.save()  # zero-copy transfer into your own database
adata = first_dataset.load()  # cache and load into memory
local_filepath = first_dataset.cache()  # cache and return file path
with first_dataset.open() as adata:  # stream slices from cloud storage
    ...
first_dataset <- datasets[[1]]
first_dataset$save()  # zero-copy transfer into your own database
adata <- first_dataset$load()  # cache and load into memory
local_filepath <- first_dataset$cache()  # cache and return file path
with(first_dataset$open(), {  # stream slices from cloud storage
  ...
})

Under the hood, these methods preserve a run object that points back to the original dataset in the Arc database so that downstream processing can be traced back to the source. For example, here we used the .save() method to sync the Tahoe-100M datasets into a database for benchmarking different ML data loaders:

LaminHub example of lineage-aware syncing of Tahoe-100M datasets

By applying fast data loaders such as annbatch[2] or scdataset[3] to locally cached arrays, one can achieve loading times of 50k - 80k vectors/second. Here is an example for such a data loading run.

For a detailed walk-through, read the tutorial: docs.lamin.ai/arc-virtual-cell-atlas.

Entities

The database is organized around entity types that are familiar from single-cell analysis workflows. In LaminDB, these entity types map to biological ontologies and experimental registries through an adaptation of the Django ORM. You can explore them on the UI and in the API reference:

Entity (click to explore)

Examples

Source

Organism

Homo sapiens, Mus musculus, …

Sample / study metadata

Tissue

brain, liver, …

Sample metadata

Disease

study-level disease annotations

Sample metadata (see Arc note on study-level disease)

CellLine

Cellosaurus IDs, common names

scBaseCount sample fields; Tahoe cell_line / cell_name

ExperimentalFactor

single-cell vs nucleus, 10x chemistry, …

lib_prep, tech_10x, cell_prep, etc.

Compound

compounds, concentrations

drug, drugname_drugconc

Project

scBaseCount, Tahoe-100M

Dataset program

STARsolo count feature

Gene, GeneFull_Ex50pAS, Velocyto, …

scBaseCount feature types

Release

version_tag e.g. 2026-01-12

scBaseCount release folder

Releases

The Arc Virtual Cell Atlas combines two major data resources: Tahoe-100M[4] and scBaseCount.[1] scBaseCount comes with two releases, which we mirror. You can use the version_tag to select a release or keep the default of is_latest=True to select the latest release. For Tahoe-100M, the latest release is 2025-02-25.

version_tag

Arc release

Scale

2026-01-12

Publication release (current)

>502M cells, 27 organisms, 5 STARsolo count features

2025-02-25

Initial release

>230M cells, 21 organisms

Other atlases

laminlabs/arc-virtual-cell-atlas exists alongside laminlabs/cellxgene, laminlabs/hubmap, and other public atlases mirrored as LaminDB instances at lamin.ai/explore, allowing the same query patterns to be reused across multiple resources.

Code & data availability

Acknowledgements

We’re grateful to the creators of the original resource[1][4] for sharing it publicly on a scalable storage backend. We’re particularly grateful to Nicholas Youngblut for helping with questions regarding the structure of the atlas and reviewing the tutorial.

Author contributions

Sunny created the database as a mirror of the Arc Virtual Cell Atlas. Sergei developed the data layer, Fred the backend, and Chaichontat the frontend. Alex supervised the project.

How to cite

Please cite the original references! If the mirror is useful to you, consider citing:

Sun S, Rybakov S, Enard F, Sriworarat C & Wolf A (2026). Simpler queries for the 2.5B transcriptional profiles of the Arc Virtual Cell Atlas. Lamin Blog. https://blog.lamin.ai/arc-virtual-cell-atlas

References