Single-Cell ENCODE Data
When to Use
- User wants to find or analyze single-cell data (scRNA-seq, scATAC-seq, snRNA-seq) from ENCODE
- User asks about "single-cell", "scRNA-seq", "scATAC-seq", "cell type annotation", or "single-nucleus"
- User needs to integrate ENCODE single-cell data with bulk epigenomic profiles
- User wants to identify cell-type-specific regulatory elements from single-cell chromatin accessibility
- Example queries: "find scRNA-seq data in ENCODE for brain", "what snATAC-seq is available?", "integrate single-cell with bulk ChIP-seq"
Help the user find and work with ENCODE single-cell genomics data, understand quality limitations relative to bulk assays, and integrate single-cell with bulk ENCODE profiles for cell-type-resolved regulatory analysis.
Literature Foundation
| # | Reference | Key Contribution |
|---|---|---|
| 1 | Mawla & Huising 2019, Endocrinology, DOI:10.1210/en.2018-01037 (~200 cit) | Cross-study scRNA-seq meta-analysis revealing that only ~1-2% of heterogeneity-driving genes replicate across studies; TIN-based quality assessment; detection-limit awareness framework. PMC6609986. |
| 2 | Regev et al. 2017, eLife, DOI:10.7554/eLife.27041 (~1,200 cit) | Human Cell Atlas white paper defining the vision for comprehensive single-cell reference maps of all human cells. Establishes community standards for cell atlas construction. |
| 3 | Stuart et al. 2019, Cell, DOI:10.1016/j.cell.2019.05.031 (~7,000 cit) | Seurat v3 — CCA-based anchor identification for cross-dataset integration. The most widely used scRNA-seq integration framework. |
| 4 | Luecken & Theis 2019, Mol Syst Biol, DOI:10.15252/msb.20188746 (~1,500 cit) | Current best practices for scRNA-seq analysis: QC, normalization, batch correction, feature selection, dimensionality reduction, clustering, and differential expression. |
| 5 | Buenrostro et al. 2015, Nature, DOI:10.1038/nature14590 (~1,800 cit) | Single-cell ATAC-seq method. Established that individual cells yield the same nucleosomal fragment size ladder as bulk ATAC-seq, enabling chromatin accessibility profiling at single-cell resolution. |
| 6 | Granja et al. 2021, Nat Genet, DOI:10.1038/s41588-021-00790-6 (~1,000 cit) | ArchR — scalable framework for scATAC-seq analysis including peak calling, gene activity scoring, trajectory inference, and integration with scRNA-seq. |
| 7 | Luecken et al. 2022, Nat Methods, DOI:10.1038/s41592-021-01336-8 (~800 cit) | Benchmarking atlas-level integration methods across tasks, metrics, and scalability. Establishes evaluation framework (kBET, LISI, ARI, NMI) for comparing integration quality. |
| 8 | Hao et al. 2021, Cell, DOI:10.1016/j.cell.2021.04.048 (~5,000 cit) | Seurat v4 — weighted nearest neighbors (WNN) for multimodal integration of RNA + ATAC (or CITE-seq). Defines the standard for joint profiling analysis. |
| 9 | ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | ENCODE Phase 3; registry of candidate cis-regulatory elements (cCREs) providing the bulk reference against which single-cell data can be compared. |
Available Single-Cell Assays in ENCODE
| Assay | What It Measures | Key Outputs | Typical Files in ENCODE |
|---|---|---|---|
| scRNA-seq | Single-cell gene expression | Cell-type-specific transcriptomes | FASTQ, gene quantifications (TSV), filtered count matrices, h5ad |
| scATAC-seq | Single-cell chromatin accessibility | Cell-type-specific regulatory elements | FASTQ, fragments (TSV), aggregate peaks (BED), cell-barcode assignments |
Step 1: Search for Single-Cell Data in ENCODE
Search for scRNA-seq and scATAC-seq experiments in the tissue of interest:
# Single-cell RNA-seq
encode_search_experiments(
assay_title="scRNA-seq",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=50
)
# Single-cell ATAC-seq
encode_search_experiments(
assay_title="snATAC-seq",
organ="pancreas",
biosample_type="tissue",
limit=50
)
If no results, try broader search terms:
encode_search_experiments(search_term="single cell RNA", organ="pancreas", limit=50)
encode_search_experiments(search_term="single cell ATAC", organ="pancreas", limit=50)
Check facets first to understand what organs have single-cell data:
encode_get_facets(assay_title="scRNA-seq")
encode_get_facets(assay_title="snATAC-seq")
Present a summary to the user showing:
- Number of scRNA-seq and scATAC-seq experiments found
- Organs/tissues represented
- Platforms used (10X Chromium, Smart-seq2, Drop-seq)
- Labs contributing data
- Number of unique donors/biosamples
Step 2: Understand ENCODE Single-Cell Data Structure
scRNA-seq Files
Use encode_list_files to see what is available per experiment:
encode_list_files(
experiment_accession="ENCSR...",
assembly="GRCh38",
preferred_default=True
)
Typical file hierarchy:
- FASTQ (
output_type="reads"): Raw sequencing reads with cell barcodes and UMIs - Gene quantifications (
output_type="gene quantifications", format TSV): Count matrices (genes x cells) after ENCODE uniform pipeline processing - Filtered counts (
output_type="filtered feature barcode matrix"): Post-QC cell-filtered matrices ready for analysis - h5ad: AnnData format when available (convenient for Scanpy workflows)
scATAC-seq Files
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
assembly="GRCh38"
)
Typical file hierarchy:
- FASTQ (
output_type="reads"): Raw reads with cell barcodes - Fragments (TSV/BED): Fragment files with cell-barcode assignments — the primary input for ArchR/Signac
- Peaks (BED narrowPeak): Aggregate peak calls across all cells (pseudo-bulk)
- Cell assignments: Barcode-to-cluster or barcode-to-cell-type mapping files
Key difference: scATAC-seq data is extremely sparse at the single-cell level. Most analyses operate on the fragment file, not on per-cell peak calls.
ENCODE Blacklist filtering (required for scATAC-seq): Before any downstream analysis of scATAC-seq peaks or fragments, remove reads/peaks overlapping ENCODE Blacklist regions (Amemiya et al. 2019, Scientific Reports, 1,372 citations). These regions produce artifactual signal in chromatin accessibility assays and inflate per-cell quality metrics (TSS enrichment, FRiP). Both ArchR and Signac apply blacklist filtering by default when provided, but verify it is active. Download blacklists from Boyle-Lab/Blacklist:
- Human GRCh38:
hg38-blacklist.v2.bed.gz - Mouse mm10:
mm10-blacklist.v2.bed.gz
Step 3: Assess Single-Cell Quality (ENCODE-Specific Considerations)
Check experiment-level quality:
encode_get_experiment(accession="ENCSR...")
Quality Metrics for scRNA-seq
| Metric | 10X Chromium | Smart-seq2 | Red Flag |
|---|---|---|---|
| Genes per cell (median) | 1,500-4,000 | 4,000-8,000 | <500 |
| UMIs per cell (median) | 3,000-15,000 | N/A (no UMIs) | <1,000 |
| Mitochondrial % | <10-15% | <10-15% | >25% |
| Doublet rate (estimated) | 2-8% (cell-count dependent) | <2% (plate-based) | >10% |
| Mapping rate | >80% | >80% | <60% |
| Saturation | >40% | N/A | <20% |
Quality Metrics for scATAC-seq
| Metric | Acceptable | Red Flag |
|---|---|---|
| Unique fragments per cell | >3,000 | <1,000 |
| TSS enrichment per cell | >5 | <2 |
| Fraction in peaks | >20% | <10% |
| Nucleosomal banding | Clear mono/di/tri pattern | Absent or noisy |
| Doublet rate | <5% | >10% |
ENCODE Audit Flags
Apply the same audit hierarchy as bulk data:
- ERROR: Avoid unless no alternative
- NOT_COMPLIANT: Usable with caveats
- WARNING: Generally safe; document
- INTERNAL_ACTION: DCC processing notes; usually not a concern
Track passing experiments:
encode_track_experime