ENCODE ChIP-seq Pipeline
When to Use
- User wants to run a ChIP-seq processing pipeline from FASTQ to peaks and signal tracks
- User asks about "ChIP-seq pipeline", "MACS2", "peak calling", "BWA alignment for ChIP", or "IDR"
- User needs to process histone or TF ChIP-seq data following ENCODE standards
- Example queries: "process my ChIP-seq FASTQs", "run the ENCODE ChIP-seq pipeline", "call peaks from ChIP-seq with MACS2 and IDR"
Execute the ENCODE ChIP-seq processing pipeline from raw FASTQ files through peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.
Overview
The ENCODE ChIP-seq pipeline processes chromatin immunoprecipitation sequencing data through a series of well-defined stages: quality control, adapter trimming, alignment to a reference genome, filtering and deduplication, peak calling with MACS2, replicate consistency analysis via IDR, and signal track generation. Each stage is parameterized according to ENCODE standards and produces QC metrics for comprehensive quality assessment.
This pipeline handles both transcription factor (TF) ChIP-seq and histone modification ChIP-seq, automatically selecting narrow or broad peak calling modes as appropriate.
Key Literature
| Reference | Journal | Year | DOI | Relevance |
|---|---|---|---|---|
| Landt et al. "ChIP-seq guidelines and practices" | Genome Research | 2012 | 10.1101/gr.136184.111 | ENCODE ChIP-seq standards (~4,000 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Zhang et al. "Model-based Analysis of ChIP-Seq (MACS)" | Genome Biology | 2008 | 10.1186/gb-2008-9-9-r137 | Peak caller (~7,000 citations) |
| Li et al. "Measuring reproducibility (IDR)" | Annals of Applied Statistics | 2011 | 10.1214/11-AOAS466 | Replicate consistency (~1,500 citations) |
| Amemiya et al. "ENCODE Blacklist" | Scientific Reports | 2019 | 10.1038/s41598-019-45839-z | Artifact regions (~1,372 citations) |
| Ramachandran et al. "phantompeakqualtools" | — | 2013 | — | NSC/RSC strand correlation metrics |
Pipeline Stages
FASTQ ──> FastQC / Trim Galore ──> BWA-MEM ──> Samtools Filter ──> Picard MarkDup
│ │
│ ┌───────────────────────────────────────────────────────────┘
│ v
│ Blacklist Filter ──> MACS2 Peak Calling ──> IDR Analysis
│ │ │
│ v v
│ Signal Tracks QC Report (MultiQC)
│ (bigWig)
v
Raw QC Report
Stage Summary
| Stage | Tool | Input | Output | Reference |
|---|---|---|---|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | BWA-MEM | Trimmed FASTQ | Sorted BAM | references/02-alignment.md |
| 3. Filtering | Picard, Samtools, bedtools | Sorted BAM | Filtered BAM | references/03-filtering.md |
| 4. Peak Calling & IDR | MACS2, IDR | Filtered BAM | Peaks (narrowPeak/broadPeak) | references/04-analysis.md |
| 5. QC & Signal | deeptools, phantompeakqualtools | Filtered BAM, Peaks | bigWig, QC report | references/05-qc-metrics.md |
Input Requirements
Required Files
- Treatment FASTQ: ChIP sample reads (single-end or paired-end, gzipped)
- Control FASTQ: Input/IgG control reads (matching single-end or paired-end)
- Reference genome: BWA-indexed genome (GRCh38 for human, mm10 for mouse)
Sample Sheet Format
sample_id,treatment_r1,treatment_r2,control_r1,control_r2,target,peak_type
SAMPLE1,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27ac,narrow
SAMPLE2,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27me3,broad
Narrow vs Broad Peak Mode Decision
| Peak Type | Targets | MACS2 Mode |
|---|---|---|
| Narrow | H3K4me3, H3K4me1, H3K27ac, H3K9ac, all TFs, CTCF | --qvalue 0.05 (default) |
| Broad | H3K27me3, H3K36me3, H3K9me3, H3K79me2 | --broad --broad-cutoff 0.1 |
QC Thresholds
These thresholds follow ENCODE standards established by Landt et al. 2012 and the ENCODE DCC quality metrics documentation.
| Metric | Threshold | Category | Source |
|---|---|---|---|
| Total sequenced reads | ≥20M (TF), ≥45M (histone) | Read depth | Landt 2012 |
| Mapping rate | >80% | Alignment | ENCODE |
| NRF (non-redundant fraction) | ≥0.8 | Library complexity | ENCODE |
| PBC1 (PCR bottleneck coeff 1) | ≥0.8 | Library complexity | ENCODE |
| PBC2 (PCR bottleneck coeff 2) | ≥3 | Library complexity | ENCODE |
| NSC (normalized strand coeff) | >1.05 | Enrichment | phantompeakqualtools |
| RSC (relative strand corr) | >0.8 | Enrichment | phantompeakqualtools |
| FRiP (fraction reads in peaks) | ≥1% | Peak quality | Landt 2012 |
| IDR optimal peaks | >20,000 (TF) | Reproducibility | ENCODE |
| Duplication rate | <30% | Library complexity | ENCODE |
| Mitochondrial fraction | <5% | Sample quality | ENCODE |
Interpreting QC: Traffic Light System
| Color | Meaning | Action |
|---|---|---|
| Green | All metrics pass | Proceed to analysis |
| Yellow | 1-2 metrics marginal | Review library prep, may be usable |
| Red | Multiple failures | Do not use; re-do experiment |
Important: No single metric is sufficient. Interpret QC collectively. A sample with borderline NRF but excellent FRiP may still be usable.
Execution
Quick Start (Local Docker)
nextflow run scripts/main.nf \
-profile local \
--reads 'fastq/*_R{1,2}.fq.gz' \
--control 'fastq/input_R{1,2}.fq.gz' \
--genome GRCh38 \
--peak_type narrow \
--outdir results/
SLURM HPC
nextflow run scripts/main.nf \
-profile slurm \
--reads 'fastq/*_R{1,2}.fq.gz' \
--control 'fastq/input_R{1,2}.fq.gz' \
--genome GRCh38 \
--peak_type narrow \
--outdir results/
Google Cloud
nextflow run scripts/main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
--control 'gs://bucket/fastq/input_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 'gs://bucket/results/'
AWS Batch
nextflow run scripts/main.nf \
-profile aws \
--reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
--control 's3://bucket/fastq/input_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 's3://bucket/results/'
Cloud Cost Estimates
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|---|---|---|---|
| GCP | n1-standard-8 | ~$2-5 | 2-4 hours | Preemptible recommended |
| AWS | m5.2xlarge | ~$2-5 | 2-4 hours | Spot instances recommended |
| Local | 8 cores, 32GB | $0 | 3-6 hours | Docker required |
| SLURM | 8 cores, 32GB | Varies | 2-4 hours | Singularity recommended |
Output Directory Structure
results/
fastqc/ # Raw and trimmed QC reports
trimmed/ # Trimmed FASTQ files
aligned/ # Sorted BAM files
filtered/ # Filtered, deduplicated BAM
peaks/
narrow/ # narrowPeak files (TF, active histone marks)
broad/ # broadPeak files (repressive marks)
idr/ # IDR-filtered reproducible peaks
signal/
fold_change/ # Fold change over control (bigWig)
pvalue/ # Signal p-value tracks (bigWig)
qc/
phantompeakqualtools/ # NSC/RSC strand correlation
multiqc/ # Aggregated QC report
logs/ # Nextflow execution logs
Common Pitfalls
1. Missing Input Control
ChIP-seq requires a matched input (or IgG) control for accurate peak calling. Without it, MACS2 will call peaks against a uniform background model