ENCODE CUT&RUN Pipeline: FASTQ to Peaks and Signal Tracks

When to Use

User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
User asks about "CUT&RUN pipeline", "CUT&Tag", "SEACR", "spike-in normalization", or "targeted chromatin"
User needs to process CUT&RUN/CUT&Tag data with spike-in calibration and SEACR peak calling
Example queries: "process my CUT&RUN FASTQs", "run SEACR on CUT&Tag data", "normalize CUT&RUN with spike-in controls"

Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling, producing peak calls with SEACR and spike-in normalized signal tracks.

Pipeline Overview

FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks
                     |                          |              |
              Bowtie2 align (spike-in)   Spike-in normalize  Signal tracks
                     |
              Scale factor calculation

ENCODE Repository

GitHub: ENCODE-DCC/cutandrun-pipeline
Container: encodedcc/cutandrun-pipeline
This skill: Nextflow DSL2 reimplementation for portability

Core Tools and Versions

Tool	Version	Purpose	Citation
Bowtie2	2.5.3	Alignment (genome + spike-in)	Langmead & Salzberg 2012
SEACR	1.3	Peak calling (CUT&RUN-specific)	Meers et al. 2019
MACS2	2.2.9.1	Alternative peak caller	Zhang et al. 2008
Picard	3.1.1	Duplicate marking	Broad Institute
samtools	1.19	BAM operations	Li et al. 2009
bedtools	2.31.0	Genomic arithmetic	Quinlan & Hall 2010
deepTools	3.5.4	Signal track generation	Ramirez et al. 2016
FastQC	0.12.1	Read quality	Andrews (Babraham)
MultiQC	1.21	Aggregated QC	Ewels et al. 2016

Key Literature

Skene & Henikoff 2017 - "An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations) DOI: 10.7554/eLife.21856
Meers et al. 2019 - "Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations) DOI: 10.1186/s13072-019-0287-4
Kaya-Okur et al. 2019 - "CUT&Tag for efficient epigenomic profiling of small samples and single cells" (Nature Communications, ~1,200 citations) DOI: 10.1038/s41467-019-09982-5
Nordin et al. 2023 - "The CUT&RUN suspect list of problematic regions" (Genome Biology) DOI: 10.1186/s13059-023-02960-3
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z

Execution

Quick Start (Local)

nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

SLURM HPC

nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

Cloud (GCP / AWS)

nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \
    --spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
    --outdir 'gs://bucket/results/' \
    -resume

Resource Requirements

Step	CPUs	RAM	Time (per sample)
Bowtie2 align (genome)	8	8 GB	30-60 min
Bowtie2 align (spike-in)	4	4 GB	10-20 min
Filter/dedup	4	8 GB	15-30 min
SEACR peaks	2	4 GB	10-20 min
Signal tracks	4	8 GB	15-30 min
Total	8	8 GB	1.5-3 hours

Pipeline Parameters

Parameter	Default	Description
`--reads`	required	Glob pattern to paired FASTQ files
`--bowtie2_index`	required	Bowtie2 genome index prefix
`--spikein_index`	required	Bowtie2 E. coli spike-in index prefix
`--chrom_sizes`	required	Chromosome sizes file
`--blacklist`	required	ENCODE blacklist BED file
`--outdir`	`./results`	Output directory
`--seacr_mode`	`stringent`	SEACR mode: `stringent` or `relaxed`
`--seacr_norm`	`norm`	SEACR normalization: `norm` or `non`
`--control`	`null`	IgG control BAM (if available)
`--peak_caller`	`seacr`	Peak caller: `seacr` or `macs2` or `both`
`--skip_spikein`	`false`	Skip spike-in normalization

Output Files

results/
  fastqc/                           # Raw read quality
  alignment/
    {sample}.filtered.bam           # Filtered, deduplicated BAM
    {sample}.filtered.bam.bai
  spikein/
    {sample}.spikein_counts.txt     # Spike-in read counts
    {sample}.scale_factor.txt       # Computed scale factor
  peaks/
    {sample}.seacr.stringent.bed    # SEACR stringent peaks
    {sample}.seacr.relaxed.bed      # SEACR relaxed peaks
    {sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested)
  signal/
    {sample}.normalized.bw          # Spike-in normalized signal
    {sample}.fragments.bed          # Fragment BED file
  qc/
    {sample}.flagstat.txt
    {sample}.fragment_sizes.txt
    {sample}.frip.txt
  multiqc/
    multiqc_report.html

QC Thresholds

Metric	Pass	Warning	Fail
Mapping rate (genome)	>80%	60-80%	<60%
Spike-in reads	1-10% of total	0.1-1% or 10-30%	<0.1% or >30%
Duplication rate	<20%	20-40%	>40%
FRiP (peaks)	>10%	5-10%	<5%
Peak count	>5,000	1,000-5,000	<1,000
Fragment size	Nucleosomal pattern	Irregular	No pattern

Fragment Size Distribution

CUT&RUN produces a characteristic nucleosomal ladder:

<120 bp: Sub-nucleosomal (TF binding)
~150 bp: Mononucleosomal (histone marks)
~300 bp: Dinucleosomal
Absence of nucleosomal pattern suggests protocol issues

Spike-in Normalization

Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.

How It Works

E. coli DNA is carried over from pA-MNase/pA-Tn5 production
Each sample has a different amount of spike-in reads
Samples with more target cleavage have fewer spike-in reads (proportionally)
Scale factor = 1 / (spike-in reads / minimum spike-in reads across samples)

Scale Factor Calculation

Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum)
Sample B: 400,000 spike-in reads -> scale = 0.5
Sample C: 100,000 spike-in reads -> scale = 2.0

Higher spike-in counts = less target enrichment = lower scale factor.

SEACR vs MACS2

Feature	SEACR	MACS2
Designed for	CUT&RUN/CUT&Tag	ChIP-seq
Background model	Sparse enrichment	Dynamic Poisson
Control required	Optional (IgG)	Recommended
Low background	Handles well	May overcall
Stringent mode	Very conservative	Via q-value
ENCODE recommendation	Primary for CUT&RUN	Alternative

SEACR is specifically designed for the sparse, low-background signal profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.

Critical Pitfalls

Spike-in Calibration is CRITICAL

Without spike-in normalization, quantitative comparisons between samples are unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments, and spike-in reads provide the internal calibration standard.

IgG Control vs No-Antibody Control

IgG control: Non-specific antibody, captures background binding
No-antibody: No antibody, c

pipeline-cutandrun

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday