Aggregate Histone ChIP-seq Peaks Across Studies

When to Use

User wants to combine histone ChIP-seq peaks across multiple ENCODE experiments for a tissue or cell type
User asks "where is H3K27ac in pancreas?" or "build a histone mark map for liver"
User needs a union peak set from multiple donors, labs, or replicates
User wants to create a consensus binding map from multiple ChIP-seq datasets
Example queries: "aggregate all H3K4me3 peaks in brain", "combine histone marks across donors", "build enhancer map from H3K27ac data"

Build a comprehensive map of histone mark binding for a tissue/cell type by merging narrowPeak files from multiple ENCODE experiments into a union peak set.

Scientific Rationale

The question: "Does my tissue have this histone mark, and at what genomic locations?"

This is a detection/cataloging question, not a differential one. Once a histone mark passes noise thresholds (ENCODE IDR, quality metrics), detection is binary — the mark is either bound or not. If detected in one donor but not another, that region is still a real binding site. Individual variation and technical differences (lab, depth, antibody lot) explain absence, not that presence is spurious.

Therefore: we want the UNION of all detections, not a consensus.

Literature Support

ChIP-Atlas (Oki et al. 2018, EMBO Reports, 597 citations): Integrated >70,000 public ChIP-seq datasets using union of all peak calls
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Created unified chromatin state annotations by integrating all peaks across 1,128 ChIP-seq experiments
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Defined the comprehensive set of problematic genomic regions to filter from all functional genomics analyses. Essential quality step. DOI
Perna et al. 2024 (BMC Genomics): Found top 25% signalValue peaks most consistent across different processing pipelines — use as per-sample noise filter
ChIP-R (Newell et al. 2020, 26 citations): Rank-product method for combining peaks from multiple replicates without BAMs, works directly on narrowPeak files
MSPC (Jalili et al. 2021, BMC Bioinformatics): Rescues weak-but-real binding sites that IDR discards by exploiting replicates to lower calling thresholds — more sensitive alternative for union-based approaches
Hecht et al. 2023 (PLoS Comp Bio): Probability-of-Being-Signal (PBS) approach for cross-dataset comparison with differing read depths

Step 1: Find All Available Experiments

Search for all histone ChIP-seq data for the target mark and tissue:

encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K4me1",        # or H3K27ac, H3K4me3, H3K27me3, etc.
    organ="pancreas",         # user's tissue of interest
    biosample_type="tissue",  # or "cell line", "primary cell"
    limit=100
)

Present a summary table to the user showing:

Number of experiments found
Labs represented
Number of unique donors/biosamples
Any audit flags

Use encode_get_facets first if unsure what's available:

encode_get_facets(assay_title="Histone ChIP-seq", organ="pancreas")

Step 2: Quality-Gate Each Experiment

For each experiment, check quality before including:

encode_get_experiment(accession="ENCSR...")

Include if:

Audit status: no ERROR flags (WARNING is acceptable)
Has IDR thresholded peaks (passed replicate concordance)
Sequencing depth meets ENCODE standards (10M+ for narrow marks, 20M+ for broad)

Exclude if:

ERROR audit flags
Only pseudoreplicated peaks (no IDR = did not pass reproducibility)
Known antibody issues (check audit details)

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download IDR Thresholded NarrowPeak Files

For each passing experiment, get the peak files:

encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38"
)

File selection priority:

IDR thresholded peaks (gold standard — passed replicate concordance)
Optimal IDR peaks (pooled replicates — most complete set)
Replicated peaks (alternative peak caller output)

Prefer preferred_default=True files when available.

Download all selected files:

encode_download_files(
    file_accessions=["ENCFF...", "ENCFF...", ...],
    download_dir="/path/to/data/narrowpeaks",
    organize_by="flat"
)

Step 4: Per-Sample Noise Filtering

IMPORTANT: Filter BEFORE merging, not after.

4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

Remove artifact-prone regions (centromeres, telomeres, rDNA repeats, satellite repeats):

# Download ENCODE blocklist for GRCh38 from:
# https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
# For mm10: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/mm10-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak

4b. SignalValue Filtering (Perna et al. 2024)

Filter each sample's peaks to retain those above the 25th percentile signalValue (column 7 in narrowPeak). The top 75% of peaks by signalValue are the most reliable across processing pipelines:

# Calculate the 25th percentile of the signalValue DISTRIBUTION for this sample
# (This is a true quantile, not 25% of the range)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak

Note: This is a per-sample filter. Each experiment has different signal distributions. Do NOT apply a universal threshold across samples.

Step 5: Union Merge Across Samples

5a. Handling Broad vs Narrow Marks

Different histone marks have different peak characteristics:

Mark Class	Examples	Peak Type	Merge Gap (-d)
Narrow/point	H3K4me3, H3K27ac, H3K4me1, H3K9ac	Sharp peaks	0 (default, overlap only)
Broad/domain	H3K27me3, H3K9me3, H3K36me3	Wide domains	1000-5000bp

5b. Label Peaks by Sample Before Merge

CRITICAL: To count unique SAMPLES (not overlapping peaks), tag each peak with its sample ID before concatenation:

# Tag each sample's peaks with a unique sample ID (column 4 = name field)
awk -v sid="sample1" 'BEGIN{OFS="\t"} {$4=sid; print}' sample1.qfiltered.narrowPeak > sample1.tagged.bed
awk -v sid="sample2" 'BEGIN{OFS="\t"} {$4=sid; print}' sample2.qfiltered.narrowPeak > sample2.tagged.bed
# ... repeat for all samples

# Concatenate all tagged peaks
cat sample*.tagged.bed > all_peaks.bed

# Sort by coordinate
bedtools sort -i all_peaks.bed > all_peaks.sorted.bed

5c. Union Merge Command

# Merge overlapping peaks, counting UNIQUE SAMPLES (not peaks)
# For NARROW marks (H3K4me3, H3K27ac, H3K4me1):
bedtools merge \
    -i all_peaks.sorted.bed \
    -c 4,7,9 \
    -o count_distinct,max,max \
    > union_peaks.bed
# Output columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue

# For BROAD marks (H3K27me3, H3K9me3, H3K36me3):
bedtools merge \
    -i all_peaks.sorted.bed \
    -d 1000 \
    -c 4,7,9 \
    -o count_distinct,max,max \
    > union_peaks.bed

Why count_distinct matters: Without it, a sample with 3 overlapping peaks inflates the count to 3 instead of 1, making the confidence annotation wrong.

Note on broad marks: H3K27me3, H3K9me3, and H3K36me3 are typically called as broadPeak (not narrowPeak) in ENCODE. When downloading, check for output_type="replicated peaks" with file_type="bed broadPeak". BroadPeak has the same first 9 columns as narrowPeak minus the summit col

histone-aggregation

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes