LaminDB — Biological Data Management

Overview

LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.

When to Use

Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
Tracking computational lineage (which code produced which data)
Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
Building queryable data lakehouses across multiple experiments
Ensuring reproducibility with automatic environment and provenance capture
Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
Standardizing metadata with ontology-based annotation (Bionty)
For single-cell analysis pipelines (clustering, DE), use scanpy instead
For ontology lookups only without data management, use bionty directly

Prerequisites

pip install lamindb
# With extras for specific data types
pip install 'lamindb[bionty,zarr,fcs]'

Setup: Requires instance initialization before use:

lamin login
lamin init --storage ./my-data --name my-project
# Or with cloud storage:
# lamin init --storage s3://my-bucket --name my-project --db postgresql://...

Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).

Quick Start

import lamindb as ln

ln.track()  # Start lineage tracking

# Save an artifact
import pandas as pd
df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]})
artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores")
artifact.save()
print(f"Saved: {artifact.uid}, size: {artifact.size}")

# Query artifacts
results = ln.Artifact.filter(key__startswith="results/").df()
print(f"Found {len(results)} artifacts")

ln.finish()

Core API

1. Artifacts — Data Objects

Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).

import lamindb as ln
import pandas as pd
import anndata as ad

ln.track()

# From DataFrame
df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]})
artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save()
print(f"ID: {artifact.uid}, Version: {artifact.version}")

# From AnnData
adata = ad.read_h5ad("counts.h5ad")
artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save()

# From file path
artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save()

# Load back
df_loaded = artifact.load()  # Returns DataFrame/AnnData/etc.
path = artifact.cache()       # Returns local file path

# Versioning
artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save()
print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}")
print(f"Latest version: {artifact_v2.is_latest}")

# Delete (archive first, then permanent)
artifact.delete(permanent=False)  # Archive
# artifact.delete(permanent=True)  # Permanent deletion

2. Lineage Tracking

Automatic provenance capture for reproducibility.

import lamindb as ln

# Start tracking — captures notebook/script, environment, user
ln.track(params={"method": "PCA", "n_components": 50})

# All artifacts created within this block are linked to this run
input_data = ln.Artifact.get(key="raw/counts.h5ad")
adata = input_data.load()

# ... analysis code ...

output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save()

# View lineage graph
output.view_lineage()

ln.finish()  # Finalize tracking

3. Querying and Filtering

Search and filter artifacts by metadata, features, and annotations.

import lamindb as ln

# Basic filtering
artifacts = ln.Artifact.filter(key__startswith="scrna/").df()
print(f"Found {len(artifacts)} scRNA-seq artifacts")

# Filter by metadata
recent = ln.Artifact.filter(
    created_at__gte="2026-01-01",
    size__gt=1000000
).df()

# Filter by annotated features
immune = ln.Artifact.filter(
    cell_types__name="T cell",
    tissues__name="PBMC"
).df()

# Single record retrieval
artifact = ln.Artifact.get(key="results/final.parquet")  # Exact match, raises if not found
artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none()  # Returns None if missing

# Full-text search
results = ln.Artifact.search("gene expression PBMC")

# Streaming large files (without full load into memory)
artifact = ln.Artifact.get(key="large_dataset.h5ad")
backed = artifact.open()  # AnnData-backed mode
subset = backed[backed.obs["cell_type"] == "B cell"]

4. Annotation and Validation

Curate datasets against schemas and ontology terms.

import lamindb as ln
import bionty as bt

# Annotate artifacts with features
artifact = ln.Artifact.get(key="scrna/batch1.h5ad")
artifact.features.add_values({
    "tissue": "PBMC",
    "condition": "treated",
    "organism": "human",
    "batch": 1
})

# Validate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
    artifact = curator.save_artifact(key="validated/batch1.h5ad")
    print("Validation passed")
except ln.errors.ValidationError as e:
    print(f"Validation failed: {e}")

# Standardize cell type names using ontology
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

5. Biological Ontologies (Bionty)

Access standardized biological vocabularies for annotation.

import bionty as bt

# Available ontologies
# bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL),
# bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO),
# bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon)

# Import and search ontology
bt.CellType.import_source()
results = bt.CellType.search("T helper")
print(results.head())

# Get specific term
t_cell = bt.CellType.get(name="T cell")
print(f"Ontology ID: {t_cell.ontology_id}")

# Explore hierarchy
children = t_cell.children.all()
parents = t_cell.parents.all()
print(f"Children: {[c.name for c in children]}")

# Validate a list of terms
validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"])
# Returns boolean array: [True, True, False]

6. Collections and Organization

Group related artifacts for batch operations.

import lamindb as ln

# Create a collection
artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all()
collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save()
print(f"Collection: {collection.name}, {collection.n_objects} artifacts")

# Query collection
for artifact in collection.artifacts.all():
    print(f"  {artifact.key}: {artifact.size} bytes")

# Organize with hierarchical keys
# Convention: project/experiment/datatype/file
# e.g., "immunology/exp42/scrna/counts.h5ad"

Key Concepts

Core Entity Model

Entity	Purpose	Example
Artifact	Versioned data object	`counts.h5ad`, `results.parquet`
Run	Single code execution	Notebook run, script execution
Transform	Code definition (notebook, script, pipeline)	`analysis.ipynb`
Feature	Typed metadata field	`tissue`, `condition`, `batch`
Collection	Group of related artifacts	"Experiment batches"
ULabel	Universal label for custom categorization	"high_quality", "pilot"

Data Types Supported

Format	Method	Use Case
DataFrame	`Artifact.from_df()`	Tabular data, metadata tables
AnnData	`Artifact.from_anndata()`	Single-cell data
MuData	`Artifact.from_mudata()`	Multi-modal data
Any file	`Artifact("path")`	Images, FASTQ, custom formats
Zarr	Via zarr extra	Large

lamindb-data-management

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday