LaminDB — Biological Data Management
Overview
LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.
When to Use
- Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
- Tracking computational lineage (which code produced which data)
- Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
- Building queryable data lakehouses across multiple experiments
- Ensuring reproducibility with automatic environment and provenance capture
- Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
- Standardizing metadata with ontology-based annotation (Bionty)
- For single-cell analysis pipelines (clustering, DE), use scanpy instead
- For ontology lookups only without data management, use bionty directly
Prerequisites
pip install lamindb
# With extras for specific data types
pip install 'lamindb[bionty,zarr,fcs]'
Setup: Requires instance initialization before use:
lamin login
lamin init --storage ./my-data --name my-project
# Or with cloud storage:
# lamin init --storage s3://my-bucket --name my-project --db postgresql://...
Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).
Quick Start
import lamindb as ln
ln.track() # Start lineage tracking
# Save an artifact
import pandas as pd
df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]})
artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores")
artifact.save()
print(f"Saved: {artifact.uid}, size: {artifact.size}")
# Query artifacts
results = ln.Artifact.filter(key__startswith="results/").df()
print(f"Found {len(results)} artifacts")
ln.finish()
Core API
1. Artifacts — Data Objects
Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).
import lamindb as ln
import pandas as pd
import anndata as ad
ln.track()
# From DataFrame
df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]})
artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save()
print(f"ID: {artifact.uid}, Version: {artifact.version}")
# From AnnData
adata = ad.read_h5ad("counts.h5ad")
artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save()
# From file path
artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save()
# Load back
df_loaded = artifact.load() # Returns DataFrame/AnnData/etc.
path = artifact.cache() # Returns local file path
# Versioning
artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save()
print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}")
print(f"Latest version: {artifact_v2.is_latest}")
# Delete (archive first, then permanent)
artifact.delete(permanent=False) # Archive
# artifact.delete(permanent=True) # Permanent deletion
2. Lineage Tracking
Automatic provenance capture for reproducibility.
import lamindb as ln
# Start tracking — captures notebook/script, environment, user
ln.track(params={"method": "PCA", "n_components": 50})
# All artifacts created within this block are linked to this run
input_data = ln.Artifact.get(key="raw/counts.h5ad")
adata = input_data.load()
# ... analysis code ...
output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save()
# View lineage graph
output.view_lineage()
ln.finish() # Finalize tracking
3. Querying and Filtering
Search and filter artifacts by metadata, features, and annotations.
import lamindb as ln
# Basic filtering
artifacts = ln.Artifact.filter(key__startswith="scrna/").df()
print(f"Found {len(artifacts)} scRNA-seq artifacts")
# Filter by metadata
recent = ln.Artifact.filter(
created_at__gte="2026-01-01",
size__gt=1000000
).df()
# Filter by annotated features
immune = ln.Artifact.filter(
cell_types__name="T cell",
tissues__name="PBMC"
).df()
# Single record retrieval
artifact = ln.Artifact.get(key="results/final.parquet") # Exact match, raises if not found
artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none() # Returns None if missing
# Full-text search
results = ln.Artifact.search("gene expression PBMC")
# Streaming large files (without full load into memory)
artifact = ln.Artifact.get(key="large_dataset.h5ad")
backed = artifact.open() # AnnData-backed mode
subset = backed[backed.obs["cell_type"] == "B cell"]
4. Annotation and Validation
Curate datasets against schemas and ontology terms.
import lamindb as ln
import bionty as bt
# Annotate artifacts with features
artifact = ln.Artifact.get(key="scrna/batch1.h5ad")
artifact.features.add_values({
"tissue": "PBMC",
"condition": "treated",
"organism": "human",
"batch": 1
})
# Validate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
try:
curator.validate()
artifact = curator.save_artifact(key="validated/batch1.h5ad")
print("Validation passed")
except ln.errors.ValidationError as e:
print(f"Validation failed: {e}")
# Standardize cell type names using ontology
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
5. Biological Ontologies (Bionty)
Access standardized biological vocabularies for annotation.
import bionty as bt
# Available ontologies
# bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL),
# bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO),
# bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon)
# Import and search ontology
bt.CellType.import_source()
results = bt.CellType.search("T helper")
print(results.head())
# Get specific term
t_cell = bt.CellType.get(name="T cell")
print(f"Ontology ID: {t_cell.ontology_id}")
# Explore hierarchy
children = t_cell.children.all()
parents = t_cell.parents.all()
print(f"Children: {[c.name for c in children]}")
# Validate a list of terms
validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"])
# Returns boolean array: [True, True, False]
6. Collections and Organization
Group related artifacts for batch operations.
import lamindb as ln
# Create a collection
artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all()
collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save()
print(f"Collection: {collection.name}, {collection.n_objects} artifacts")
# Query collection
for artifact in collection.artifacts.all():
print(f" {artifact.key}: {artifact.size} bytes")
# Organize with hierarchical keys
# Convention: project/experiment/datatype/file
# e.g., "immunology/exp42/scrna/counts.h5ad"
Key Concepts
Core Entity Model
| Entity | Purpose | Example |
|---|---|---|
| Artifact | Versioned data object | counts.h5ad, results.parquet |
| Run | Single code execution | Notebook run, script execution |
| Transform | Code definition (notebook, script, pipeline) | analysis.ipynb |
| Feature | Typed metadata field | tissue, condition, batch |
| Collection | Group of related artifacts | "Experiment batches" |
| ULabel | Universal label for custom categorization | "high_quality", "pilot" |
Data Types Supported
| Format | Method | Use Case |
|---|---|---|
| DataFrame | Artifact.from_df() | Tabular data, metadata tables |
| AnnData | Artifact.from_anndata() | Single-cell data |
| MuData | Artifact.from_mudata() | Multi-modal data |
| Any file | Artifact("path") | Images, FASTQ, custom formats |
| Zarr | Via zarr extra | Large |