Aggregate Histone ChIP-seq Peaks Across Studies
When to Use
- User wants to combine histone ChIP-seq peaks across multiple ENCODE experiments for a tissue or cell type
- User asks "where is H3K27ac in pancreas?" or "build a histone mark map for liver"
- User needs a union peak set from multiple donors, labs, or replicates
- User wants to create a consensus binding map from multiple ChIP-seq datasets
- Example queries: "aggregate all H3K4me3 peaks in brain", "combine histone marks across donors", "build enhancer map from H3K27ac data"
Build a comprehensive map of histone mark binding for a tissue/cell type by merging narrowPeak files from multiple ENCODE experiments into a union peak set.
Scientific Rationale
The question: "Does my tissue have this histone mark, and at what genomic locations?"
This is a detection/cataloging question, not a differential one. Once a histone mark passes noise thresholds (ENCODE IDR, quality metrics), detection is binary — the mark is either bound or not. If detected in one donor but not another, that region is still a real binding site. Individual variation and technical differences (lab, depth, antibody lot) explain absence, not that presence is spurious.
Therefore: we want the UNION of all detections, not a consensus.
Literature Support
- ChIP-Atlas (Oki et al. 2018, EMBO Reports, 597 citations): Integrated >70,000 public ChIP-seq datasets using union of all peak calls
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Created unified chromatin state annotations by integrating all peaks across 1,128 ChIP-seq experiments
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Defined the comprehensive set of problematic genomic regions to filter from all functional genomics analyses. Essential quality step. DOI
- Perna et al. 2024 (BMC Genomics): Found top 25% signalValue peaks most consistent across different processing pipelines — use as per-sample noise filter
- ChIP-R (Newell et al. 2020, 26 citations): Rank-product method for combining peaks from multiple replicates without BAMs, works directly on narrowPeak files
- MSPC (Jalili et al. 2021, BMC Bioinformatics): Rescues weak-but-real binding sites that IDR discards by exploiting replicates to lower calling thresholds — more sensitive alternative for union-based approaches
- Hecht et al. 2023 (PLoS Comp Bio): Probability-of-Being-Signal (PBS) approach for cross-dataset comparison with differing read depths
Step 1: Find All Available Experiments
Search for all histone ChIP-seq data for the target mark and tissue:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K4me1", # or H3K27ac, H3K4me3, H3K27me3, etc.
organ="pancreas", # user's tissue of interest
biosample_type="tissue", # or "cell line", "primary cell"
limit=100
)
Present a summary table to the user showing:
- Number of experiments found
- Labs represented
- Number of unique donors/biosamples
- Any audit flags
Use encode_get_facets first if unsure what's available:
encode_get_facets(assay_title="Histone ChIP-seq", organ="pancreas")
Step 2: Quality-Gate Each Experiment
For each experiment, check quality before including:
encode_get_experiment(accession="ENCSR...")
Include if:
- Audit status: no ERROR flags (WARNING is acceptable)
- Has IDR thresholded peaks (passed replicate concordance)
- Sequencing depth meets ENCODE standards (10M+ for narrow marks, 20M+ for broad)
Exclude if:
- ERROR audit flags
- Only pseudoreplicated peaks (no IDR = did not pass reproducibility)
- Known antibody issues (check audit details)
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download IDR Thresholded NarrowPeak Files
For each passing experiment, get the peak files:
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
File selection priority:
- IDR thresholded peaks (gold standard — passed replicate concordance)
- Optimal IDR peaks (pooled replicates — most complete set)
- Replicated peaks (alternative peak caller output)
Prefer preferred_default=True files when available.
Download all selected files:
encode_download_files(
file_accessions=["ENCFF...", "ENCFF...", ...],
download_dir="/path/to/data/narrowpeaks",
organize_by="flat"
)
Step 4: Per-Sample Noise Filtering
IMPORTANT: Filter BEFORE merging, not after.
4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
Remove artifact-prone regions (centromeres, telomeres, rDNA repeats, satellite repeats):
# Download ENCODE blocklist for GRCh38 from:
# https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
# For mm10: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/mm10-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak
4b. SignalValue Filtering (Perna et al. 2024)
Filter each sample's peaks to retain those above the 25th percentile signalValue (column 7 in narrowPeak). The top 75% of peaks by signalValue are the most reliable across processing pipelines:
# Calculate the 25th percentile of the signalValue DISTRIBUTION for this sample
# (This is a true quantile, not 25% of the range)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak
Note: This is a per-sample filter. Each experiment has different signal distributions. Do NOT apply a universal threshold across samples.
Step 5: Union Merge Across Samples
5a. Handling Broad vs Narrow Marks
Different histone marks have different peak characteristics:
| Mark Class | Examples | Peak Type | Merge Gap (-d) |
|---|---|---|---|
| Narrow/point | H3K4me3, H3K27ac, H3K4me1, H3K9ac | Sharp peaks | 0 (default, overlap only) |
| Broad/domain | H3K27me3, H3K9me3, H3K36me3 | Wide domains | 1000-5000bp |
5b. Label Peaks by Sample Before Merge
CRITICAL: To count unique SAMPLES (not overlapping peaks), tag each peak with its sample ID before concatenation:
# Tag each sample's peaks with a unique sample ID (column 4 = name field)
awk -v sid="sample1" 'BEGIN{OFS="\t"} {$4=sid; print}' sample1.qfiltered.narrowPeak > sample1.tagged.bed
awk -v sid="sample2" 'BEGIN{OFS="\t"} {$4=sid; print}' sample2.qfiltered.narrowPeak > sample2.tagged.bed
# ... repeat for all samples
# Concatenate all tagged peaks
cat sample*.tagged.bed > all_peaks.bed
# Sort by coordinate
bedtools sort -i all_peaks.bed > all_peaks.sorted.bed
5c. Union Merge Command
# Merge overlapping peaks, counting UNIQUE SAMPLES (not peaks)
# For NARROW marks (H3K4me3, H3K27ac, H3K4me1):
bedtools merge \
-i all_peaks.sorted.bed \
-c 4,7,9 \
-o count_distinct,max,max \
> union_peaks.bed
# Output columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue
# For BROAD marks (H3K27me3, H3K9me3, H3K36me3):
bedtools merge \
-i all_peaks.sorted.bed \
-d 1000 \
-c 4,7,9 \
-o count_distinct,max,max \
> union_peaks.bed
Why count_distinct matters: Without it, a sample with 3 overlapping peaks inflates the count to 3 instead of 1, making the confidence annotation wrong.
Note on broad marks: H3K27me3, H3K9me3, and H3K36me3 are typically called as broadPeak (not narrowPeak) in ENCODE. When downloading, check for output_type="replicated peaks" with file_type="bed broadPeak". BroadPeak has the same first 9 columns as narrowPeak minus the summit col