Integrative Analysis of ENCODE Data
When to Use
- User wants to combine multiple ENCODE experiments for cross-dataset analysis
- User asks about "integrating", "combining", or "comparing" experiments
- User wants to overlay histone marks with accessibility or expression data
- User needs to plan a multi-omic analysis using ENCODE data
- User asks about peak overlap, differential binding, or signal correlation
- User wants to perform ChromHMM segmentation using ENCODE histone data
Help the user combine multiple ENCODE experiments for cross-dataset or multi-omic analysis. This skill covers the full integration workflow: from defining the question and selecting compatible experiments, through choosing the right integration strategy and tools, to validating results and documenting provenance.
Literature Foundation
| Reference | Journal | Key Contribution | DOI | Citations |
|---|---|---|---|---|
| ENCODE Phase 3 (2020) | Nature | Registry of 926,535 candidate cis-regulatory elements; integrative analysis framework across 5,992 experiments | 10.1038/s41586-020-2493-4 | ~1,656 |
| Gorkin et al. (2020) | Nature | Integrative analysis of 3,158 mouse epigenomes; cross-tissue chromatin state annotation | 10.1038/s41586-020-2093-3 | ~301 |
| Ernst & Kellis (2012) | Nature Methods | ChromHMM: chromatin state discovery from combinatorial histone mark patterns | 10.1038/nmeth.1906 | ~2,294 |
| Nasser et al. (2021) | Nature | Activity-by-Contact (ABC) model for enhancer-gene linkage; outperforms proximity assignment | 10.1038/s41586-021-03446-x | ~468 |
| Quinlan & Hall (2010) | Bioinformatics | BEDTools: genome arithmetic for interval comparisons, intersections, and merges | 10.1093/bioinformatics/btq033 | ~10,000 |
| Ramirez et al. (2016) | Nucleic Acids Res | deepTools: signal normalization, correlation, and visualization for multi-sample genomic data | 10.1093/nar/gkw257 | ~3,000 |
| Love et al. (2014) | Genome Biology | DESeq2: differential analysis of count data with shrinkage estimation | 10.1186/s13059-014-0550-8 | ~40,000 |
| Ross-Innes et al. (2012) | Nature | DiffBind: differential binding analysis of ChIP-seq peak data across conditions | 10.1038/nature10730 | ~1,200 |
| Leek et al. (2010) | Nature Rev Genetics | Tackling batch effects: PCA-based detection, SVA/ComBat correction, experimental design | 10.1038/nrg2825 | ~1,200 |
Step 1: Define the Integration Question
Clarify with the user which type of integration they need. There are four fundamental designs:
| Integration Design | Example | Key Challenge |
|---|---|---|
| Same assay, cross-sample | H3K27ac ChIP-seq across 5 tissues | Batch effects between labs/donors |
| Multi-omic, same sample | ATAC-seq + RNA-seq + ChIP-seq in K562 | Matching file types and normalization |
| Cross-organism | Human vs mouse liver chromatin | Ortholog mapping, synteny conservation |
| Perturbation / condition | Before vs after treatment | Need matched replicates per condition |
Each design has different requirements for compatibility, normalization, and statistical framework. Establish the design before searching for data.
Questions to ask the user:
- What biological question are you trying to answer?
- Are you comparing across samples (differential) or combining across samples (cataloging)?
- How many conditions/tissues/time points?
- Do you need statistical testing or descriptive overlap?
Step 2: Find Compatible Experiments
2a. Explore Data Availability
Start with encode_get_facets to understand what data exists before committing to a design:
encode_get_facets(
assay_title="Histone ChIP-seq",
organ="pancreas"
)
This returns counts by target, biosample, lab, and other facets. Use it to verify that the intended comparison has sufficient data on both sides.
2b. Search for Candidate Experiments
Search for experiments matching each arm of the integration:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
biosample_type="tissue",
limit=100
)
For multi-omic designs, search each assay layer separately:
# Accessibility layer
encode_search_experiments(assay_title="ATAC-seq", organ="pancreas", limit=50)
# Expression layer
encode_search_experiments(assay_title="total RNA-seq", organ="pancreas", limit=50)
# Histone layer
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="pancreas", limit=50)
Present a summary table to the user showing experiments found per arm, number of replicates, labs represented, and any audit flags.
Step 3: Check Pairwise Compatibility
Track candidate experiments and then check compatibility:
encode_track_experiment(accession="ENCSR...")
encode_track_experiment(accession="ENCSR...")
encode_compare_experiments(
accession1="ENCSR...",
accession2="ENCSR..."
)
The compatibility check evaluates:
| Dimension | Compatible | Requires Action | Incompatible |
|---|---|---|---|
| Organism | Same species | Cross-species with ortholog mapping | N/A (always addressable) |
| Assembly | Same build (GRCh38) | Different builds (need liftOver) | Mixed within analysis without lifting |
| Assay | Same assay | Different assays (expected in multi-omic) | N/A |
| Biosample | Same term name | Different biosamples (expected in cross-sample) | Unexpected mismatch |
| Lab | Same lab | Different labs (flag for batch effects) | N/A |
| Pipeline | Same version | Different versions (flag, may need reprocessing) | Fundamentally different pipelines |
| Replicates | 2+ biological | 1 replicate (limited statistical power) | 0 replicates (unusable) |
Critical rule: ALL experiments in an integration MUST share the same genome assembly. Never mix GRCh38 and hg19 coordinates without explicit liftOver.
Step 4: Select Matched Files
For each experiment, retrieve files using encode_list_files:
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38",
preferred_default=True
)
File Matching Rules
All files entering the same integration MUST be matched on:
- Same assembly (GRCh38 for human, mm10 for mouse)
- Same output type (e.g., all "IDR thresholded peaks" or all "fold change over control")
- Same file format (all narrowPeak, all bigWig, all TSV)
- Same pipeline version when possible (check ENCODE pipeline annotations)
File Type Compatibility Matrix
Not all file types can be directly integrated. This matrix shows which combinations are valid:
| File Type A | File Type B | Integration Method | Valid? |
|---|---|---|---|
| narrowPeak | narrowPeak | BEDTools intersect/merge | Yes |
| narrowPeak | broadPeak | BEDTools intersect (with caveats) | Yes, but peak resolution differs |
| narrowPeak | bigWig | Signal extraction at peak locations | Yes |
| bigWig | bigWig | deepTools multiBigwigSummary | Yes |
| bigWig | narrowPeak | Signal quantification within peaks | Yes |
| gene quant TSV | gene quant TSV | DESeq2 count matrix | Yes |
| gene quant TSV | narrowPeak | Gene-centric: peaks near expressed genes | Yes (indirect) |
| contact matrix | narrowPeak | Loops anchored at peaks | Yes (resolution-dependent) |
| narrowPeak | gene quant TSV | Enhancer-gene linkage (ABC model) | Yes (requires Hi-C) |
Cannot directly combine:
- Raw FASTQ with proces