Integrative Analysis of ENCODE Data

When to Use

User wants to combine multiple ENCODE experiments for cross-dataset analysis
User asks about "integrating", "combining", or "comparing" experiments
User wants to overlay histone marks with accessibility or expression data
User needs to plan a multi-omic analysis using ENCODE data
User asks about peak overlap, differential binding, or signal correlation
User wants to perform ChromHMM segmentation using ENCODE histone data

Help the user combine multiple ENCODE experiments for cross-dataset or multi-omic analysis. This skill covers the full integration workflow: from defining the question and selecting compatible experiments, through choosing the right integration strategy and tools, to validating results and documenting provenance.

Literature Foundation

Reference	Journal	Key Contribution	DOI	Citations
ENCODE Phase 3 (2020)	Nature	Registry of 926,535 candidate cis-regulatory elements; integrative analysis framework across 5,992 experiments	10.1038/s41586-020-2493-4	~1,656
Gorkin et al. (2020)	Nature	Integrative analysis of 3,158 mouse epigenomes; cross-tissue chromatin state annotation	10.1038/s41586-020-2093-3	~301
Ernst & Kellis (2012)	Nature Methods	ChromHMM: chromatin state discovery from combinatorial histone mark patterns	10.1038/nmeth.1906	~2,294
Nasser et al. (2021)	Nature	Activity-by-Contact (ABC) model for enhancer-gene linkage; outperforms proximity assignment	10.1038/s41586-021-03446-x	~468
Quinlan & Hall (2010)	Bioinformatics	BEDTools: genome arithmetic for interval comparisons, intersections, and merges	10.1093/bioinformatics/btq033	~10,000
Ramirez et al. (2016)	Nucleic Acids Res	deepTools: signal normalization, correlation, and visualization for multi-sample genomic data	10.1093/nar/gkw257	~3,000
Love et al. (2014)	Genome Biology	DESeq2: differential analysis of count data with shrinkage estimation	10.1186/s13059-014-0550-8	~40,000
Ross-Innes et al. (2012)	Nature	DiffBind: differential binding analysis of ChIP-seq peak data across conditions	10.1038/nature10730	~1,200
Leek et al. (2010)	Nature Rev Genetics	Tackling batch effects: PCA-based detection, SVA/ComBat correction, experimental design	10.1038/nrg2825	~1,200

Step 1: Define the Integration Question

Clarify with the user which type of integration they need. There are four fundamental designs:

Integration Design	Example	Key Challenge
Same assay, cross-sample	H3K27ac ChIP-seq across 5 tissues	Batch effects between labs/donors
Multi-omic, same sample	ATAC-seq + RNA-seq + ChIP-seq in K562	Matching file types and normalization
Cross-organism	Human vs mouse liver chromatin	Ortholog mapping, synteny conservation
Perturbation / condition	Before vs after treatment	Need matched replicates per condition

Each design has different requirements for compatibility, normalization, and statistical framework. Establish the design before searching for data.

Questions to ask the user:

What biological question are you trying to answer?
Are you comparing across samples (differential) or combining across samples (cataloging)?
How many conditions/tissues/time points?
Do you need statistical testing or descriptive overlap?

Step 2: Find Compatible Experiments

2a. Explore Data Availability

Start with encode_get_facets to understand what data exists before committing to a design:

encode_get_facets(
    assay_title="Histone ChIP-seq",
    organ="pancreas"
)

This returns counts by target, biosample, lab, and other facets. Use it to verify that the intended comparison has sufficient data on both sides.

2b. Search for Candidate Experiments

Search for experiments matching each arm of the integration:

encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    biosample_type="tissue",
    limit=100
)

For multi-omic designs, search each assay layer separately:

# Accessibility layer
encode_search_experiments(assay_title="ATAC-seq", organ="pancreas", limit=50)

# Expression layer
encode_search_experiments(assay_title="total RNA-seq", organ="pancreas", limit=50)

# Histone layer
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="pancreas", limit=50)

Present a summary table to the user showing experiments found per arm, number of replicates, labs represented, and any audit flags.

Step 3: Check Pairwise Compatibility

Track candidate experiments and then check compatibility:

encode_track_experiment(accession="ENCSR...")
encode_track_experiment(accession="ENCSR...")

encode_compare_experiments(
    accession1="ENCSR...",
    accession2="ENCSR..."
)

The compatibility check evaluates:

Dimension	Compatible	Requires Action	Incompatible
Organism	Same species	Cross-species with ortholog mapping	N/A (always addressable)
Assembly	Same build (GRCh38)	Different builds (need liftOver)	Mixed within analysis without lifting
Assay	Same assay	Different assays (expected in multi-omic)	N/A
Biosample	Same term name	Different biosamples (expected in cross-sample)	Unexpected mismatch
Lab	Same lab	Different labs (flag for batch effects)	N/A
Pipeline	Same version	Different versions (flag, may need reprocessing)	Fundamentally different pipelines
Replicates	2+ biological	1 replicate (limited statistical power)	0 replicates (unusable)

Critical rule: ALL experiments in an integration MUST share the same genome assembly. Never mix GRCh38 and hg19 coordinates without explicit liftOver.

Step 4: Select Matched Files

For each experiment, retrieve files using encode_list_files:

encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38",
    preferred_default=True
)

File Matching Rules

All files entering the same integration MUST be matched on:

Same assembly (GRCh38 for human, mm10 for mouse)
Same output type (e.g., all "IDR thresholded peaks" or all "fold change over control")
Same file format (all narrowPeak, all bigWig, all TSV)
Same pipeline version when possible (check ENCODE pipeline annotations)

File Type Compatibility Matrix

Not all file types can be directly integrated. This matrix shows which combinations are valid:

File Type A	File Type B	Integration Method	Valid?
narrowPeak	narrowPeak	BEDTools intersect/merge	Yes
narrowPeak	broadPeak	BEDTools intersect (with caveats)	Yes, but peak resolution differs
narrowPeak	bigWig	Signal extraction at peak locations	Yes
bigWig	bigWig	deepTools multiBigwigSummary	Yes
bigWig	narrowPeak	Signal quantification within peaks	Yes
gene quant TSV	gene quant TSV	DESeq2 count matrix	Yes
gene quant TSV	narrowPeak	Gene-centric: peaks near expressed genes	Yes (indirect)
contact matrix	narrowPeak	Loops anchored at peaks	Yes (resolution-dependent)
narrowPeak	gene quant TSV	Enhancer-gene linkage (ABC model)	Yes (requires Hi-C)

Cannot directly combine:

Raw FASTQ with proces

integrative-analysis

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes