CZ CELLxGENE Census
Overview
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
- 61+ million cells from human and mouse
- Standardized metadata (cell types, tissues, diseases, donors)
- Raw gene expression matrices
- Pre-calculated embeddings and statistics
- Integration with PyTorch, scanpy, and other analysis tools
When to Use This Skill
This skill should be used when:
- Querying single-cell expression data by cell type, tissue, or disease
- Exploring available single-cell datasets and metadata
- Training machine learning models on single-cell data
- Performing large-scale cross-dataset analyses
- Integrating Census data with scanpy or other analysis frameworks
- Computing statistics across millions of cells
- Accessing pre-calculated embeddings or model predictions
Installation and Setup
Install the Census API:
uv pip install cellxgene-census
For machine learning workflows, install additional dependencies:
uv pip install cellxgene-census[experimental]
Core Workflow Patterns
1. Opening the Census
Always use the context manager to ensure proper resource cleanup:
import cellxgene_census
# Open latest stable version
with cellxgene_census.open_soma() as census:
# Work with census data
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Work with census data
Key points:
- Use context manager (
withstatement) for automatic cleanup - Specify
census_versionfor reproducible analyses - Default opens latest "stable" release
2. Exploring Census Information
Before querying expression data, explore available datasets and metadata.
Access summary information:
# Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")
# Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
Query cell metadata to understand available data:
# Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")
# Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()
Important: Always filter for is_primary_data == True to avoid counting duplicate cells unless specifically analyzing duplicates.
3. Querying Expression Data (Small to Medium Scale)
For queries returning < 100k cells that fit in memory, use get_anndata():
# Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens", # or "Mus musculus"
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
# Query specific genes with multiple filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
Filter syntax:
- Use
obs_value_filterfor cell filtering - Use
var_value_filterfor gene filtering - Combine conditions with
and,or - Use
infor multiple values:tissue in ['lung', 'liver'] - Select only needed columns with
obs_column_names
Getting metadata separately:
# Query cell metadata
cell_metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "donor_id"]
)
# Query gene metadata
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A']",
column_names=["feature_id", "feature_name", "feature_length"]
)
4. Large-Scale Queries (Out-of-Core Processing)
For queries exceeding available RAM, use axis_query() with iterative processing:
import tiledbsoma as soma
# Create axis query
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Iterate through expression matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
# batch is a pyarrow.Table with columns:
# - soma_data: expression value
# - soma_dim_0: cell (obs) coordinate
# - soma_dim_1: gene (var) coordinate
process_batch(batch)
Computing incremental statistics:
# Example: Calculate mean expression
n_observations = 0
sum_values = 0.0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
n_observations += len(values)
sum_values += values.sum()
mean_expression = sum_values / n_observations
5. Machine Learning with PyTorch
For training models, use the experimental PyTorch integration:
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"]["cell_type"] # Cell type labels
# Forward pass
outputs = model(X)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
Train/test splitting:
from cellxgene_census.experimental.ml import ExperimentDataset
# Create dataset from experiment
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
# Split into train and test
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
6. Integration with Scanpy
Seamlessly integrate Census data with scanpy workflows:
import scanpy as sc
# Load data from Census
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)
# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
# Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
7. Multi-Dataset Integration
Query and integrate multiple datasets:
# Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"]
adatas = []
for tissue in tissues:
adata = cell