ENCODE CUT&RUN Pipeline: FASTQ to Peaks and Signal Tracks
When to Use
- User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
- User asks about "CUT&RUN pipeline", "CUT&Tag", "SEACR", "spike-in normalization", or "targeted chromatin"
- User needs to process CUT&RUN/CUT&Tag data with spike-in calibration and SEACR peak calling
- Example queries: "process my CUT&RUN FASTQs", "run SEACR on CUT&Tag data", "normalize CUT&RUN with spike-in controls"
Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling, producing peak calls with SEACR and spike-in normalized signal tracks.
Pipeline Overview
FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks
| | |
Bowtie2 align (spike-in) Spike-in normalize Signal tracks
|
Scale factor calculation
ENCODE Repository
- GitHub:
ENCODE-DCC/cutandrun-pipeline - Container:
encodedcc/cutandrun-pipeline - This skill: Nextflow DSL2 reimplementation for portability
Core Tools and Versions
| Tool | Version | Purpose | Citation |
|---|---|---|---|
| Bowtie2 | 2.5.3 | Alignment (genome + spike-in) | Langmead & Salzberg 2012 |
| SEACR | 1.3 | Peak calling (CUT&RUN-specific) | Meers et al. 2019 |
| MACS2 | 2.2.9.1 | Alternative peak caller | Zhang et al. 2008 |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| bedtools | 2.31.0 | Genomic arithmetic | Quinlan & Hall 2010 |
| deepTools | 3.5.4 | Signal track generation | Ramirez et al. 2016 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Key Literature
-
Skene & Henikoff 2017 - "An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations) DOI: 10.7554/eLife.21856
-
Meers et al. 2019 - "Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations) DOI: 10.1186/s13072-019-0287-4
-
Kaya-Okur et al. 2019 - "CUT&Tag for efficient epigenomic profiling of small samples and single cells" (Nature Communications, ~1,200 citations) DOI: 10.1038/s41467-019-09982-5
-
Nordin et al. 2023 - "The CUT&RUN suspect list of problematic regions" (Genome Biology) DOI: 10.1186/s13059-023-02960-3
-
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Execution
Quick Start (Local)
nextflow run main.nf \
-profile local \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index '/ref/bowtie2_index/genome' \
--spikein_index '/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--blacklist '/ref/hg38-blacklist.v2.bed' \
--outdir results/ \
-resume
SLURM HPC
nextflow run main.nf \
-profile slurm \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index '/ref/bowtie2_index/genome' \
--spikein_index '/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--blacklist '/ref/hg38-blacklist.v2.bed' \
--outdir results/ \
-resume
Cloud (GCP / AWS)
nextflow run main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \
--spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
--blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
--outdir 'gs://bucket/results/' \
-resume
Resource Requirements
| Step | CPUs | RAM | Time (per sample) |
|---|---|---|---|
| Bowtie2 align (genome) | 8 | 8 GB | 30-60 min |
| Bowtie2 align (spike-in) | 4 | 4 GB | 10-20 min |
| Filter/dedup | 4 | 8 GB | 15-30 min |
| SEACR peaks | 2 | 4 GB | 10-20 min |
| Signal tracks | 4 | 8 GB | 15-30 min |
| Total | 8 | 8 GB | 1.5-3 hours |
Pipeline Parameters
| Parameter | Default | Description |
|---|---|---|
--reads | required | Glob pattern to paired FASTQ files |
--bowtie2_index | required | Bowtie2 genome index prefix |
--spikein_index | required | Bowtie2 E. coli spike-in index prefix |
--chrom_sizes | required | Chromosome sizes file |
--blacklist | required | ENCODE blacklist BED file |
--outdir | ./results | Output directory |
--seacr_mode | stringent | SEACR mode: stringent or relaxed |
--seacr_norm | norm | SEACR normalization: norm or non |
--control | null | IgG control BAM (if available) |
--peak_caller | seacr | Peak caller: seacr or macs2 or both |
--skip_spikein | false | Skip spike-in normalization |
Output Files
results/
fastqc/ # Raw read quality
alignment/
{sample}.filtered.bam # Filtered, deduplicated BAM
{sample}.filtered.bam.bai
spikein/
{sample}.spikein_counts.txt # Spike-in read counts
{sample}.scale_factor.txt # Computed scale factor
peaks/
{sample}.seacr.stringent.bed # SEACR stringent peaks
{sample}.seacr.relaxed.bed # SEACR relaxed peaks
{sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested)
signal/
{sample}.normalized.bw # Spike-in normalized signal
{sample}.fragments.bed # Fragment BED file
qc/
{sample}.flagstat.txt
{sample}.fragment_sizes.txt
{sample}.frip.txt
multiqc/
multiqc_report.html
QC Thresholds
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Mapping rate (genome) | >80% | 60-80% | <60% |
| Spike-in reads | 1-10% of total | 0.1-1% or 10-30% | <0.1% or >30% |
| Duplication rate | <20% | 20-40% | >40% |
| FRiP (peaks) | >10% | 5-10% | <5% |
| Peak count | >5,000 | 1,000-5,000 | <1,000 |
| Fragment size | Nucleosomal pattern | Irregular | No pattern |
Fragment Size Distribution
CUT&RUN produces a characteristic nucleosomal ladder:
- <120 bp: Sub-nucleosomal (TF binding)
- ~150 bp: Mononucleosomal (histone marks)
- ~300 bp: Dinucleosomal
- Absence of nucleosomal pattern suggests protocol issues
Spike-in Normalization
Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.
How It Works
- E. coli DNA is carried over from pA-MNase/pA-Tn5 production
- Each sample has a different amount of spike-in reads
- Samples with more target cleavage have fewer spike-in reads (proportionally)
- Scale factor = 1 / (spike-in reads / minimum spike-in reads across samples)
Scale Factor Calculation
Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum)
Sample B: 400,000 spike-in reads -> scale = 0.5
Sample C: 100,000 spike-in reads -> scale = 2.0
Higher spike-in counts = less target enrichment = lower scale factor.
SEACR vs MACS2
| Feature | SEACR | MACS2 |
|---|---|---|
| Designed for | CUT&RUN/CUT&Tag | ChIP-seq |
| Background model | Sparse enrichment | Dynamic Poisson |
| Control required | Optional (IgG) | Recommended |
| Low background | Handles well | May overcall |
| Stringent mode | Very conservative | Via q-value |
| ENCODE recommendation | Primary for CUT&RUN | Alternative |
SEACR is specifically designed for the sparse, low-background signal profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.
Critical Pitfalls
Spike-in Calibration is CRITICAL
Without spike-in normalization, quantitative comparisons between samples are unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments, and spike-in reads provide the internal calibration standard.
IgG Control vs No-Antibody Control
- IgG control: Non-specific antibody, captures background binding
- No-antibody: No antibody, c