Integrating CellxGene Census Single-Cell Data with ENCODE Bulk Experiments

Bridge bulk ENCODE functional genomics data with cell-type-specific expression from the CellxGene Census, the largest unified single-cell RNA-seq atlas, to resolve cell-type contributions to regulatory element activity.

Scientific Rationale

The question: "Which specific cell types within my tissue drive the regulatory signals I see in bulk ENCODE data?"

ENCODE provides deeply sequenced bulk functional genomics (ChIP-seq, ATAC-seq, Hi-C) across hundreds of biosamples. But bulk data from a tissue like "pancreas" is a mixture of acinar cells (~80%), duct cells (~10%), endocrine cells (~5%), and others. An H3K27ac peak in bulk pancreas could be driven by any of these cell types. CellxGene Census provides cell-type-resolved expression data from 50M+ single-cell observations across thousands of datasets, enabling deconvolution of bulk ENCODE signals.

The Bulk-to-Single-Cell Bridge

Bulk ENCODE Signal	Single-Cell Question	CellxGene Answer
H3K27ac peak near INS gene in pancreas	Which cell type expresses INS?	Beta cells (>500 TPM), not acinar (<1 TPM)
ATAC-seq peak in liver near ALB	Is this hepatocyte-specific?	Yes — ALB expressed only in hepatocytes
Enhancer active in brain cortex	Neurons or glia?	CellxGene resolves excitatory neurons vs. astrocytes vs. oligodendrocytes
Broad H3K27ac domain in blood	Which immune cell type?	Can distinguish T cells, B cells, monocytes, NK cells

What CellxGene Census Provides

50M+ single-cell observations from thousands of published datasets
Standardized cell ontology (Cell Ontology terms) across all datasets
Unified gene expression in a consistent format
Metadata: tissue, disease status, sex, ethnicity, developmental stage
API access via Python (cellxgene-census) or R (cellxgene.census)
No authentication required for public data

Key Literature

Megill et al. 2021 "cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices" (bioRxiv preprint). Describes the CellxGene platform architecture and exploration capabilities. DOI: 10.1101/2021.04.05.438318
CZ CELLxGENE Discover (Chan Zuckerberg Initiative, 2023). CellxGene Census provides programmatic access to the entire CellxGene data corpus as a single unified dataset. https://cellxgene.cziscience.com/
Regev et al. 2017 "The Human Cell Atlas" (eLife, ~1,500 citations). The vision paper for comprehensive single-cell reference maps of all human cells. CellxGene Census is the largest realization of this vision. DOI: 10.7554/eLife.27041
Tabula Sapiens Consortium 2022 "The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans" (Science, ~800 citations). Multi-organ human cell atlas contributing to CellxGene Census. DOI: 10.1126/science.abl4896
ENCODE Project Consortium 2020 (Nature, ~1,656 citations). The bulk regulatory element catalog that CellxGene single-cell data contextualizes. DOI: 10.1038/s41586-020-2493-4

When to Use This Skill

Scenario	How CellxGene Helps
Bulk ENCODE peak near a gene — which cell type?	Query gene expression by cell type in matching tissue
ENCODE enhancer active in tissue X — cell-type-specific?	Check if enhancer target gene is restricted to one cell type
Choosing ENCODE cell line as proxy	Verify which primary cell type the cell line best represents
Interpreting differential peaks between tissues	Determine if difference is due to cell-type composition
Validating ENCODE scATAC-seq findings	Cross-reference with CellxGene scRNA-seq for same cell types
Designing follow-up experiments	Identify which cell types to isolate for validation

Python API Reference

Installation

pip install cellxgene-census

Requires Python 3.8+. The package uses TileDB-SOMA for efficient data access.

Core API Pattern

import cellxgene_census

# Open the Census (reads metadata, does not download all data)
with cellxgene_census.open_soma() as census:
    # Access human data
    human = census["census_data"]["homo_sapiens"]

    # Query specific genes in specific tissues/cell types
    # This is where filtering happens — be specific to control memory

Step 1: Identify the ENCODE Target Gene

Start from an ENCODE finding — a regulatory element near a gene of interest:

# Find enhancers in pancreas
encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    biosample_type="tissue"
)

# Get peaks
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38"
)

From peaks, identify the nearest gene(s). You need the gene symbol or Ensembl ID.

Step 2: Query CellxGene Census for Cell-Type Expression

Basic Gene Expression Query

import cellxgene_census
import pandas as pd

gene_symbol = "INS"  # Insulin — example for pancreas

with cellxgene_census.open_soma() as census:
    human = census["census_data"]["homo_sapiens"]

    # Get expression for INS in pancreas tissue
    # Use obs_value_filter to restrict to pancreas
    # Use var_value_filter to restrict to the gene
    adata = cellxgene_census.get_anndata(
        census,
        organism="Homo sapiens",
        var_value_filter=f"feature_name == '{gene_symbol}'",
        obs_value_filter="tissue_general == 'pancreas'",
        obs_column_names=["cell_type", "tissue", "disease", "dataset_id"]
    )

    # Summarize expression by cell type
    expr_by_celltype = adata.to_df().join(adata.obs["cell_type"])
    summary = expr_by_celltype.groupby("cell_type").agg(
        mean_expr=(gene_symbol, "mean"),
        pct_expressed=(gene_symbol, lambda x: (x > 0).mean() * 100),
        n_cells=(gene_symbol, "count")
    ).sort_values("mean_expr", ascending=False)

    print(summary.head(10))

Multi-Gene Query

genes_of_interest = ["INS", "GCG", "SST", "PPY"]  # Islet hormones

with cellxgene_census.open_soma() as census:
    gene_filter = " or ".join([f"feature_name == '{g}'" for g in genes_of_interest])

    adata = cellxgene_census.get_anndata(
        census,
        organism="Homo sapiens",
        var_value_filter=gene_filter,
        obs_value_filter="tissue_general == 'pancreas'",
        obs_column_names=["cell_type", "tissue", "disease"]
    )

Query by Cell Ontology Term

# More precise than tissue — query specific cell types
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census,
        organism="Homo sapiens",
        var_value_filter="feature_name == 'INS'",
        obs_value_filter="cell_type == 'type B pancreatic cell'",  # Cell Ontology term for beta cells
        obs_column_names=["cell_type", "tissue", "disease", "sex"]
    )

Step 3: Map CellxGene Cell Types to ENCODE Biosamples

CellxGene uses Cell Ontology (CL) terms. ENCODE uses its own biosample ontology. Key mappings:

CellxGene Cell Type (CL term)	ENCODE Biosample	Notes
type B pancreatic cell	pancreatic beta cell	Beta cells
hepatocyte	hepatocyte	Direct match
CD4-positive, alpha-beta T cell	CD4+ T cell	ENCODE may have more specific subtypes
monocyte	monocyte	Direct match
excitatory neuron	neuron	ENCODE may use broader category
oligodendrocyte	oligodendrocyte	Direct match
fibroblast	fibroblast	Direct match
endothelial cell	endothelial cell of umbilical vein (HUVEC)

cellxgene-context

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes