Key problems of data-heavy R&D¶
The complexity of modern R&D data often blocks realizing the scientific progress it promises.
Here, we list key problems we see and how we think about solving them.
Data can’t be accessed¶
Problem |
Description |
Solution |
---|---|---|
Object storage. |
Data in object storage can’t be queried. |
Index observations and variables and link them in a query database. |
Pile of data. |
Data can’t be accessed as it’s not structured and siloed in fragmented infrastructure. |
Structure data both by biological entities and by provenance with one interface across storage and database backends. |
Data can’t be accessed at scale¶
Problem |
Description |
Solution |
---|---|---|
Anecdotal data. |
Data can’t be accessed at scale as no viable programmatic interfaces exist. |
API-first platform. |
Cross-storage integration. |
Molecular (high-dimensional) data can’t be efficiently integrated with phenotypic (low-dimensional) data. |
Index molecular data with the same biological entities as phenotypic data. Provide connectors for low-dimensional data management systems (ELN & LIMS systems). |
Scientific results aren’t solid¶
Problem |
Description |
Solution |
---|---|---|
Stand on solid ground. |
Key analytics results cannot be linked to supporting data as too many processing steps are involved. |
Provide full data provenance. |
Collaboration across organizations is hard¶
Problem |
Description |
Solution |
---|---|---|
Siloed infrastructure. |
Data can’t be easily shared across organizations. |
Federated collaboration hub on distributed infrastructure. |
Siloed semantics. |
External data can’t be mapped on in-house data and vice versa. |
Provide curation and ingestion API, operate on open-source data models that can be adopted by any organization. |
R&D could be more effective¶
Problem |
Description |
Solution |
---|---|---|
Optimal decision making. |
There is no framework for tracking decision making in complex R&D teams. |
Graph of data flow in R&D team, including scientists, computation, decisions, predictions. Unlike workflow frameworks, LaminDB creates an emergent graph. |
Dry lab is not integrated. |
Data platforms offer no adequate interface for the drylab. |
API-first with data scientist needs in mind. |
Support learning. |
There is no support for the learning-from-data cycle. |
Support data models across the full lab cycle, including measured → relevant → derived features. Manage knowledge through rich semantic models that map high-dimensional data. |
No support for basic R&D operations¶
Problem |
Description |
Solution |
---|---|---|
Development data. |
Data associated with assay development can’t be ingested as data models are too rigid. |
Allow partial integrity in LaminDB’s implementation of a data lakehouse: ingest data of any curation level and label them with corresponding QC flags. |
Corrupted data. |
Data is often corrupted. |
Full provenance allows to trace back corruption to its origin and write a simple fix, typically, in form of an ingestion constraint. |
Building a data platform is hard¶
Problem |
Description |
Solution |
---|---|---|
Aligning data models. |
Data models are hard to align across interdisciplinary stakeholders. |
Lamin’s data model templates cover 90% of cases, the remaining 10% can be get configured. |
Lock-in. |
Existing platforms lock organizations into specific cloud infrastructure. |
Open-source and multi-cloud stack with zero lock-in danger. |
Migrations are a pain. |
Migrating data models in a fast-paced R&D environment can be prohibitive. |
LaminDB’s schema modules migrate automatically. |
Note: This problem statement was originally published as part of the lamindb
docs. It remained prominently linked from the about page of lamin.ai while traveling through various repositories with small edits: within lamindb, within lamin-about, within lamin-docs. It got moved to the blog page on 2023-08-11 and will remain there unmodified.