Integrating CellxGene Census Single-Cell Data with ENCODE Bulk Experiments
Bridge bulk ENCODE functional genomics data with cell-type-specific expression from the CellxGene Census, the largest unified single-cell RNA-seq atlas, to resolve cell-type contributions to regulatory element activity.
Scientific Rationale
The question: "Which specific cell types within my tissue drive the regulatory signals I see in bulk ENCODE data?"
ENCODE provides deeply sequenced bulk functional genomics (ChIP-seq, ATAC-seq, Hi-C) across hundreds of biosamples. But bulk data from a tissue like "pancreas" is a mixture of acinar cells (~80%), duct cells (~10%), endocrine cells (~5%), and others. An H3K27ac peak in bulk pancreas could be driven by any of these cell types. CellxGene Census provides cell-type-resolved expression data from 50M+ single-cell observations across thousands of datasets, enabling deconvolution of bulk ENCODE signals.
The Bulk-to-Single-Cell Bridge
| Bulk ENCODE Signal | Single-Cell Question | CellxGene Answer |
|---|---|---|
| H3K27ac peak near INS gene in pancreas | Which cell type expresses INS? | Beta cells (>500 TPM), not acinar (<1 TPM) |
| ATAC-seq peak in liver near ALB | Is this hepatocyte-specific? | Yes — ALB expressed only in hepatocytes |
| Enhancer active in brain cortex | Neurons or glia? | CellxGene resolves excitatory neurons vs. astrocytes vs. oligodendrocytes |
| Broad H3K27ac domain in blood | Which immune cell type? | Can distinguish T cells, B cells, monocytes, NK cells |
What CellxGene Census Provides
- 50M+ single-cell observations from thousands of published datasets
- Standardized cell ontology (Cell Ontology terms) across all datasets
- Unified gene expression in a consistent format
- Metadata: tissue, disease status, sex, ethnicity, developmental stage
- API access via Python (
cellxgene-census) or R (cellxgene.census) - No authentication required for public data
Key Literature
- Megill et al. 2021 "cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices" (bioRxiv preprint). Describes the CellxGene platform architecture and exploration capabilities. DOI: 10.1101/2021.04.05.438318
- CZ CELLxGENE Discover (Chan Zuckerberg Initiative, 2023). CellxGene Census provides programmatic access to the entire CellxGene data corpus as a single unified dataset. https://cellxgene.cziscience.com/
- Regev et al. 2017 "The Human Cell Atlas" (eLife, ~1,500 citations). The vision paper for comprehensive single-cell reference maps of all human cells. CellxGene Census is the largest realization of this vision. DOI: 10.7554/eLife.27041
- Tabula Sapiens Consortium 2022 "The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans" (Science, ~800 citations). Multi-organ human cell atlas contributing to CellxGene Census. DOI: 10.1126/science.abl4896
- ENCODE Project Consortium 2020 (Nature, ~1,656 citations). The bulk regulatory element catalog that CellxGene single-cell data contextualizes. DOI: 10.1038/s41586-020-2493-4
When to Use This Skill
| Scenario | How CellxGene Helps |
|---|---|
| Bulk ENCODE peak near a gene — which cell type? | Query gene expression by cell type in matching tissue |
| ENCODE enhancer active in tissue X — cell-type-specific? | Check if enhancer target gene is restricted to one cell type |
| Choosing ENCODE cell line as proxy | Verify which primary cell type the cell line best represents |
| Interpreting differential peaks between tissues | Determine if difference is due to cell-type composition |
| Validating ENCODE scATAC-seq findings | Cross-reference with CellxGene scRNA-seq for same cell types |
| Designing follow-up experiments | Identify which cell types to isolate for validation |
Python API Reference
Installation
pip install cellxgene-census
Requires Python 3.8+. The package uses TileDB-SOMA for efficient data access.
Core API Pattern
import cellxgene_census
# Open the Census (reads metadata, does not download all data)
with cellxgene_census.open_soma() as census:
# Access human data
human = census["census_data"]["homo_sapiens"]
# Query specific genes in specific tissues/cell types
# This is where filtering happens — be specific to control memory
Step 1: Identify the ENCODE Target Gene
Start from an ENCODE finding — a regulatory element near a gene of interest:
# Find enhancers in pancreas
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
biosample_type="tissue"
)
# Get peaks
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
From peaks, identify the nearest gene(s). You need the gene symbol or Ensembl ID.
Step 2: Query CellxGene Census for Cell-Type Expression
Basic Gene Expression Query
import cellxgene_census
import pandas as pd
gene_symbol = "INS" # Insulin — example for pancreas
with cellxgene_census.open_soma() as census:
human = census["census_data"]["homo_sapiens"]
# Get expression for INS in pancreas tissue
# Use obs_value_filter to restrict to pancreas
# Use var_value_filter to restrict to the gene
adata = cellxgene_census.get_anndata(
census,
organism="Homo sapiens",
var_value_filter=f"feature_name == '{gene_symbol}'",
obs_value_filter="tissue_general == 'pancreas'",
obs_column_names=["cell_type", "tissue", "disease", "dataset_id"]
)
# Summarize expression by cell type
expr_by_celltype = adata.to_df().join(adata.obs["cell_type"])
summary = expr_by_celltype.groupby("cell_type").agg(
mean_expr=(gene_symbol, "mean"),
pct_expressed=(gene_symbol, lambda x: (x > 0).mean() * 100),
n_cells=(gene_symbol, "count")
).sort_values("mean_expr", ascending=False)
print(summary.head(10))
Multi-Gene Query
genes_of_interest = ["INS", "GCG", "SST", "PPY"] # Islet hormones
with cellxgene_census.open_soma() as census:
gene_filter = " or ".join([f"feature_name == '{g}'" for g in genes_of_interest])
adata = cellxgene_census.get_anndata(
census,
organism="Homo sapiens",
var_value_filter=gene_filter,
obs_value_filter="tissue_general == 'pancreas'",
obs_column_names=["cell_type", "tissue", "disease"]
)
Query by Cell Ontology Term
# More precise than tissue — query specific cell types
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census,
organism="Homo sapiens",
var_value_filter="feature_name == 'INS'",
obs_value_filter="cell_type == 'type B pancreatic cell'", # Cell Ontology term for beta cells
obs_column_names=["cell_type", "tissue", "disease", "sex"]
)
Step 3: Map CellxGene Cell Types to ENCODE Biosamples
CellxGene uses Cell Ontology (CL) terms. ENCODE uses its own biosample ontology. Key mappings:
| CellxGene Cell Type (CL term) | ENCODE Biosample | Notes |
|---|---|---|
| type B pancreatic cell | pancreatic beta cell | Beta cells |
| hepatocyte | hepatocyte | Direct match |
| CD4-positive, alpha-beta T cell | CD4+ T cell | ENCODE may have more specific subtypes |
| monocyte | monocyte | Direct match |
| excitatory neuron | neuron | ENCODE may use broader category |
| oligodendrocyte | oligodendrocyte | Direct match |
| fibroblast | fibroblast | Direct match |
| endothelial cell | endothelial cell of umbilical vein (HUVEC) |