Compare ENCODE Data Across Biosamples
When to Use
- User wants to compare ENCODE experiments across different tissues, cell lines, or biosamples
- User asks about "tissue comparison", "cell-type differences", "tissue-specific enhancers", or "cross-tissue"
- User needs to identify constitutive vs tissue-specific regulatory elements
- User wants to map data availability across multiple biosamples before integrative analysis
- Example queries: "compare H3K27ac between liver and pancreas", "what marks are tissue-specific?", "find constitutive promoters across all tissues"
Help the user systematically compare data availability and experiments across different biosamples to identify tissue-specific regulatory patterns, constitutive elements, and cross-tissue differences.
Scientific Rationale
Cross-biosample comparison is the foundation of understanding tissue-specific gene regulation. Regulatory elements -- particularly enhancers -- are the primary drivers of cell-type identity, with promoters being largely shared across tissues. Comparing the same assay across multiple biosamples reveals which regulatory elements are constitutive (shared) versus tissue-specific (unique to one or few cell types).
The core question: "Which regulatory features distinguish tissue A from tissue B, and which are shared?"
This requires careful matching of datasets, awareness of batch effects, and understanding of the biosample hierarchy to avoid confounding biological signal with technical variation.
Literature Foundation
| # | Reference | Key Contribution |
|---|---|---|
| 1 | Roadmap Epigenomics Consortium 2015, Nature, DOI:10.1038/nature14248 (~5,810 cit) | Generated 111 reference epigenomes across tissues/cell types; established the framework for cross-tissue epigenomic comparison. Showed that enhancer chromatin states are the most tissue-variable elements. |
| 2 | ENCODE Phase 3 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | Expanded functional annotations to 1.3M candidate cis-regulatory elements (cCREs) across hundreds of biosamples; defined tissue-activity indices for regulatory elements. |
| 3 | Andersson et al. 2014, Nature, DOI:10.1038/nature12787 (~1,500 cit) | FANTOM5 atlas of active enhancers across 808 samples; demonstrated that only ~5% of enhancers are active across all tissues, with the majority being highly tissue-specific. |
| 4 | Heintzman et al. 2009, Nature, DOI:10.1038/nature07917 (~2,200 cit) | Showed histone modifications distinguish cell types: H3K4me1/H3K27ac at enhancers are the most discriminating tissue-specific marks, while H3K4me3 at promoters is largely shared. |
| 5 | Thurman et al. 2012, Nature, DOI:10.1038/nature11232 (~2,000 cit) | Mapped accessible chromatin across 125 cell types; demonstrated that DNase I hypersensitive sites define cell-type identity and that accessibility patterns cluster by tissue of origin. |
| 6 | Leek et al. 2010, Nat Rev Genet, DOI:10.1038/nrg2825 (~1,200 cit) | Comprehensive review of batch effects in genomic data; showed that lab, platform, and processing date can dominate biological variation if not properly controlled. |
| 7 | Forrest et al. 2014, Nature, DOI:10.1038/nature13182 (~1,100 cit) | FANTOM5 promoter-level expression atlas across 975 samples; demonstrated that promoter usage (not just gene expression) is tissue-specific and defines cell identity. |
Tissue-Specific Regulation Principles
Understanding what varies across tissues and what does not is essential before designing a comparison.
What Is Shared vs Tissue-Specific (Heintzman 2009; Andersson 2014)
| Feature | Cross-Tissue Behavior | Implication for Comparison |
|---|---|---|
| Promoters (H3K4me3) | Largely shared (~70% active in most tissues) | Poor discriminators between tissues |
| Enhancers (H3K27ac + H3K4me1) | Highly tissue-specific (~5% shared across all tissues) | Best discriminators; focus comparison here |
| Chromatin accessibility (ATAC/DNase) | Moderate tissue-specificity (~20-30% shared) | Good secondary discriminator; clusters by tissue of origin |
| Polycomb repression (H3K27me3) | Tissue-specific (marks silenced developmental genes) | Useful for identifying repressed lineage programs |
| Gene expression (RNA-seq) | Moderate tissue-specificity | Housekeeping genes shared; tissue-specific TFs are key |
| CTCF binding | Largely constitutive (~70% conserved) | Defines structural boundaries; less tissue-variable |
| DNA methylation | Bimodal; enhancers show tissue-variable methylation | Hypomethylation at active enhancers is tissue-specific |
Key Insight
H3K27ac at enhancers is the single most informative mark for distinguishing tissues (Heintzman et al. 2009, Roadmap 2015). If the user can only compare one mark across tissues, H3K27ac should be the first choice, followed by chromatin accessibility (ATAC-seq or DNase-seq).
ENCODE Biosample Hierarchy
| Level | Description | Biological Relevance | Reproducibility | Caveats |
|---|---|---|---|---|
| Tissue | Primary tissue from donor (e.g., pancreas, liver) | Highest -- in vivo biology preserved | Lower -- donor variation, cell-type heterogeneity | Mixed cell populations; composition varies by donor age/sex/health |
| Primary cell | Cells isolated from tissue (e.g., hepatocytes, islets) | High -- enriched for cell type | Moderate -- isolation stress, limited passages | Isolation method alters phenotype; culture conditions matter |
| Cell line | Immortalized cells (e.g., K562, HepG2, GM12878) | Lower -- transformed phenotype | Highest -- clonal, reproducible | May not represent normal tissue biology; passage number matters |
| In vitro differentiated | Cells derived from stem cells (e.g., iPSC-derived cardiomyocytes) | Moderate -- model system | Moderate -- protocol-dependent | Differentiation efficiency varies; often immature phenotype |
| Organoid | 3D self-organizing structures | Moderate-high -- recapitulates tissue architecture | Lower -- heterogeneous | Emerging data type in ENCODE; limited coverage |
Tier 1 Cell Lines (Most Comprehensive ENCODE Data)
| Cell Line | Origin | Cancer/Normal | Best For |
|---|---|---|---|
| K562 | Chronic myelogenous leukemia | Cancer | Hematopoietic chromatin, TF binding, 3D genome |
| GM12878 | Lymphoblastoid (EBV-transformed B cells) | Transformed-normal | Immune regulation, 3D genome (Rao et al. 2014 Hi-C reference) |
| H1-hESC | Human embryonic stem cells | Normal | Developmental regulation, bivalent chromatin |
These three cell lines have the most complete multi-omic profiling in ENCODE. They are excellent positive controls for verifying comparison pipelines before applying to user-specific tissues.
Biosample Comparability Rules
- Same biosample type preferred: Compare tissue-to-tissue, cell line-to-cell line
- Cross-type comparisons require caution: Cell line vs tissue introduces both biological and technical confounders
- Donor matching: When comparing tissues, match for life_stage, sex, and age when possible
- Passage number matters for cell lines: Different passages of the same cell line can diverge epigenomically
Step 1: Define the Comparison Design
Clarify the comparison type with the user. Each design has different requirements:
Comparison Design Patterns
| Design | Description | Required Matching | Key Tools | Best File Types |
|---|---|---|---|---|
| Cross-tissue (same assay) | Same mark/assay in different organs | Same assay, same target, same assembly, same biosample_type | encode_search_experiments, encode_get_facets | IDR thresholded peaks, fold change over control |
| Multi-omic (same tissue) | Multiple assays in one biosample | Same biosample_term_name, same assemb |