ENCODE ChIP-seq Pipeline

When to Use

User wants to run a ChIP-seq processing pipeline from FASTQ to peaks and signal tracks
User asks about "ChIP-seq pipeline", "MACS2", "peak calling", "BWA alignment for ChIP", or "IDR"
User needs to process histone or TF ChIP-seq data following ENCODE standards
Example queries: "process my ChIP-seq FASTQs", "run the ENCODE ChIP-seq pipeline", "call peaks from ChIP-seq with MACS2 and IDR"

Execute the ENCODE ChIP-seq processing pipeline from raw FASTQ files through peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.

Overview

The ENCODE ChIP-seq pipeline processes chromatin immunoprecipitation sequencing data through a series of well-defined stages: quality control, adapter trimming, alignment to a reference genome, filtering and deduplication, peak calling with MACS2, replicate consistency analysis via IDR, and signal track generation. Each stage is parameterized according to ENCODE standards and produces QC metrics for comprehensive quality assessment.

This pipeline handles both transcription factor (TF) ChIP-seq and histone modification ChIP-seq, automatically selecting narrow or broad peak calling modes as appropriate.

Key Literature

Reference	Journal	Year	DOI	Relevance
Landt et al. "ChIP-seq guidelines and practices"	Genome Research	2012	10.1101/gr.136184.111	ENCODE ChIP-seq standards (~4,000 citations)
ENCODE Project Consortium "Expanded encyclopaedias"	Nature	2020	10.1038/s41586-020-2493-4	ENCODE Phase 3 standards
Zhang et al. "Model-based Analysis of ChIP-Seq (MACS)"	Genome Biology	2008	10.1186/gb-2008-9-9-r137	Peak caller (~7,000 citations)
Li et al. "Measuring reproducibility (IDR)"	Annals of Applied Statistics	2011	10.1214/11-AOAS466	Replicate consistency (~1,500 citations)
Amemiya et al. "ENCODE Blacklist"	Scientific Reports	2019	10.1038/s41598-019-45839-z	Artifact regions (~1,372 citations)
Ramachandran et al. "phantompeakqualtools"	—	2013	—	NSC/RSC strand correlation metrics

Pipeline Stages

FASTQ ──> FastQC / Trim Galore ──> BWA-MEM ──> Samtools Filter ──> Picard MarkDup
  │                                                                       │
  │           ┌───────────────────────────────────────────────────────────┘
  │           v
  │     Blacklist Filter ──> MACS2 Peak Calling ──> IDR Analysis
  │                                │                     │
  │                                v                     v
  │                         Signal Tracks          QC Report (MultiQC)
  │                          (bigWig)
  v
 Raw QC Report

Stage Summary

Stage	Tool	Input	Output	Reference
1. QC & Trimming	FastQC, Trim Galore	Raw FASTQ	Trimmed FASTQ	references/01-qc-trimming.md
2. Alignment	BWA-MEM	Trimmed FASTQ	Sorted BAM	references/02-alignment.md
3. Filtering	Picard, Samtools, bedtools	Sorted BAM	Filtered BAM	references/03-filtering.md
4. Peak Calling & IDR	MACS2, IDR	Filtered BAM	Peaks (narrowPeak/broadPeak)	references/04-analysis.md
5. QC & Signal	deeptools, phantompeakqualtools	Filtered BAM, Peaks	bigWig, QC report	references/05-qc-metrics.md

Input Requirements

Required Files

Treatment FASTQ: ChIP sample reads (single-end or paired-end, gzipped)
Control FASTQ: Input/IgG control reads (matching single-end or paired-end)
Reference genome: BWA-indexed genome (GRCh38 for human, mm10 for mouse)

Sample Sheet Format

sample_id,treatment_r1,treatment_r2,control_r1,control_r2,target,peak_type
SAMPLE1,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27ac,narrow
SAMPLE2,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27me3,broad

Narrow vs Broad Peak Mode Decision

Peak Type	Targets	MACS2 Mode
Narrow	H3K4me3, H3K4me1, H3K27ac, H3K9ac, all TFs, CTCF	`--qvalue 0.05` (default)
Broad	H3K27me3, H3K36me3, H3K9me3, H3K79me2	`--broad --broad-cutoff 0.1`

QC Thresholds

These thresholds follow ENCODE standards established by Landt et al. 2012 and the ENCODE DCC quality metrics documentation.

Metric	Threshold	Category	Source
Total sequenced reads	≥20M (TF), ≥45M (histone)	Read depth	Landt 2012
Mapping rate	>80%	Alignment	ENCODE
NRF (non-redundant fraction)	≥0.8	Library complexity	ENCODE
PBC1 (PCR bottleneck coeff 1)	≥0.8	Library complexity	ENCODE
PBC2 (PCR bottleneck coeff 2)	≥3	Library complexity	ENCODE
NSC (normalized strand coeff)	>1.05	Enrichment	phantompeakqualtools
RSC (relative strand corr)	>0.8	Enrichment	phantompeakqualtools
FRiP (fraction reads in peaks)	≥1%	Peak quality	Landt 2012
IDR optimal peaks	>20,000 (TF)	Reproducibility	ENCODE
Duplication rate	<30%	Library complexity	ENCODE
Mitochondrial fraction	<5%	Sample quality	ENCODE

Interpreting QC: Traffic Light System

Color	Meaning	Action
Green	All metrics pass	Proceed to analysis
Yellow	1-2 metrics marginal	Review library prep, may be usable
Red	Multiple failures	Do not use; re-do experiment

Important: No single metric is sufficient. Interpret QC collectively. A sample with borderline NRF but excellent FRiP may still be usable.

Execution

Quick Start (Local Docker)

nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --control 'fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --peak_type narrow \
  --outdir results/

SLURM HPC

nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --control 'fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --peak_type narrow \
  --outdir results/

Google Cloud

nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --control 'gs://bucket/fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'

AWS Batch

nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --control 's3://bucket/fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'

Cloud Cost Estimates

Platform	Instance	Cost/Sample	Time/Sample	Notes
GCP	n1-standard-8	~$2-5	2-4 hours	Preemptible recommended
AWS	m5.2xlarge	~$2-5	2-4 hours	Spot instances recommended
Local	8 cores, 32GB	$0	3-6 hours	Docker required
SLURM	8 cores, 32GB	Varies	2-4 hours	Singularity recommended

Output Directory Structure

results/
  fastqc/                   # Raw and trimmed QC reports
  trimmed/                  # Trimmed FASTQ files
  aligned/                  # Sorted BAM files
  filtered/                 # Filtered, deduplicated BAM
  peaks/
    narrow/                 # narrowPeak files (TF, active histone marks)
    broad/                  # broadPeak files (repressive marks)
    idr/                    # IDR-filtered reproducible peaks
  signal/
    fold_change/            # Fold change over control (bigWig)
    pvalue/                 # Signal p-value tracks (bigWig)
  qc/
    phantompeakqualtools/   # NSC/RSC strand correlation
    multiqc/                # Aggregated QC report
  logs/                     # Nextflow execution logs

Common Pitfalls

1. Missing Input Control

ChIP-seq requires a matched input (or IgG) control for accurate peak calling. Without it, MACS2 will call peaks against a uniform background model

pipeline-chipseq

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes