ENCODE ATAC-seq Pipeline
When to Use
- User wants to run an ATAC-seq processing pipeline from FASTQ to peaks and signal tracks
- User asks about "ATAC-seq pipeline", "Tn5 shift", "chromatin accessibility pipeline", or "Bowtie2 for ATAC"
- User needs to process ATAC-seq data with proper Tn5 insertion site correction
- Example queries: "process my ATAC-seq FASTQs", "run ENCODE ATAC-seq pipeline", "call accessibility peaks from ATAC-seq"
Execute the ENCODE ATAC-seq processing pipeline from raw FASTQ files through Tn5 offset correction, peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.
Overview
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses the Tn5 transposase to probe open chromatin regions. The ENCODE pipeline processes ATAC-seq data through quality control, alignment with Bowtie2, Tn5 insertion site correction (+4/-5 bp offset), mitochondrial read removal, nucleosome-free fragment selection, peak calling with MACS2, and IDR-based replicate consistency analysis.
Key differences from ChIP-seq: Bowtie2 aligner (optimized for short fragments), Tn5 transposase shift correction, aggressive mitochondrial read filtering (can be 30-80% of reads), nucleosomal fragment size distribution as a QC metric, and TSS enrichment score as the primary quality indicator.
Key Literature
| Reference | Journal | Year | DOI | Relevance |
|---|---|---|---|---|
| Buenrostro et al. "Transposition of native chromatin (ATAC-seq)" | Nature Methods | 2013 | 10.1038/nmeth.2688 | Original ATAC-seq method (~5,000 citations) |
| Corces et al. "An improved ATAC-seq protocol" | Nature Methods | 2017 | 10.1038/nmeth.4396 | Omni-ATAC improvements (~2,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Amemiya et al. "ENCODE Blacklist" | Scientific Reports | 2019 | 10.1038/s41598-019-45839-z | Artifact regions (~1,372 citations) |
| Langmead & Salzberg "Fast gapped-read alignment with Bowtie 2" | Nature Methods | 2012 | 10.1038/nmeth.1923 | Aligner (~30,000 citations) |
| Yan et al. "From reads to insight: ATAC-seq analysis" | Genome Biology | 2020 | 10.1186/s13059-020-1929-3 | Analysis best practices |
Pipeline Stages
FASTQ ──> FastQC / Trim Galore ──> Bowtie2 ──> Mito Removal + Tn5 Shift
│ │
│ ┌──────────────────────────────────────────┘
│ v
│ Picard MarkDup ──> Blacklist Filter ──> Size Selection
│ │
│ ┌─────────────────┬────────────┘
│ v v
│ NFR Fragments Mono-Nucleosome
│ │
│ v
│ MACS2 Peak Calling ──> IDR Analysis
│ │ │
│ v v
│ Signal Tracks QC Report (MultiQC + ataqv)
v
Raw QC Report
Stage Summary
| Stage | Tool | Input | Output | Reference |
|---|---|---|---|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | Bowtie2 | Trimmed FASTQ | Sorted BAM | references/02-alignment.md |
| 3. Tn5 Shift & Filtering | Samtools, bedtools, Picard | Sorted BAM | Shifted, filtered BAM | references/03-tn5-filtering.md |
| 4. Peak Calling & IDR | MACS2, IDR | Filtered BAM | Peaks (narrowPeak) | references/04-peak-calling.md |
| 5. QC & Signal | deeptools, ataqv, MultiQC | Filtered BAM, Peaks | bigWig, QC report | references/05-qc-metrics.md |
Input Requirements
Required Files
- ATAC-seq FASTQ: Paired-end reads (strongly recommended; single-end supported)
- Reference genome: Bowtie2-indexed genome (GRCh38 for human, mm10 for mouse)
Sample Sheet Format
sample_id,read1,read2,replicate
SAMPLE1_rep1,atac_R1.fq.gz,atac_R2.fq.gz,1
SAMPLE1_rep2,atac_R1.fq.gz,atac_R2.fq.gz,2
No input control needed: Unlike ChIP-seq, ATAC-seq does not require a separate input or IgG control. MACS2 calls peaks against a local background model.
Tn5 Transposase Offset Correction
The Tn5 transposase inserts sequencing adapters with a 9-bp duplication. To center reads on the actual cut site:
- Forward strand (+): shift +4 bp
- Reverse strand (-): shift -5 bp
This correction is essential for accurate footprinting and motif analysis.
Fragment Size Distribution
ATAC-seq produces a characteristic nucleosomal ladder pattern:
| Fragment Class | Size Range | Biological Meaning |
|---|---|---|
| Nucleosome-free (NFR) | <150 bp | Open chromatin / TF binding |
| Mono-nucleosome | 150-300 bp | Single nucleosome wrapping |
| Di-nucleosome | 300-500 bp | Two nucleosomes |
| Tri-nucleosome | 500-700 bp | Three nucleosomes |
For peak calling, use nucleosome-free reads (<150 bp) only.
QC Thresholds
| Metric | Threshold | Category | Source |
|---|---|---|---|
| Total sequenced reads | >=50M (recommended) | Read depth | ENCODE |
| Mapping rate | >80% | Alignment | ENCODE |
| Mitochondrial fraction | <20% (ideal <5%) | Sample quality | ENCODE |
| NRF (non-redundant fraction) | >=0.8 | Library complexity | ENCODE |
| PBC1 | >=0.8 | Library complexity | ENCODE |
| TSS enrichment score | >=5 | Signal quality | ENCODE standard |
| FRiP | >=0.3 | Peak quality | ENCODE |
| NFR fraction | >0.4 of fragments <150bp | Fragment distribution | Buenrostro 2013 |
| IDR optimal peaks | >50,000 | Reproducibility | ENCODE |
TSS Enrichment Score
The TSS enrichment score measures the fold enrichment of ATAC-seq signal at transcription start sites compared to flanking regions. It is the single most informative QC metric for ATAC-seq:
| Score | Quality | Interpretation |
|---|---|---|
| >=7 | Excellent | High signal-to-noise |
| 5-7 | Good | Acceptable for most analyses |
| 3-5 | Marginal | Review other metrics carefully |
| <3 | Poor | Likely failed; consider re-doing |
Execution
Quick Start (Local Docker)
nextflow run scripts/main.nf \
-profile local \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
SLURM HPC
nextflow run scripts/main.nf \
-profile slurm \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
Google Cloud
nextflow run scripts/main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 'gs://bucket/results/'
AWS Batch
nextflow run scripts/main.nf \
-profile aws \
--reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 's3://bucket/results/'
Cloud Cost Estimates
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|---|---|---|---|
| GCP | n1-standard-8 | ~$2-4 | 2-3 hours | Preemptible recommended |
| AWS | m5.2xlarge | ~$2-4 | 2-3 hours | Spot instances recommended |
| Local | 8 cores, 32GB | $0 | 3-5 hours | Docker required |
| SLURM | 8 cores, 32GB | Varies | 2-3 hours | Singularity recommended |
Output Directory Structure
results/
fastqc/ # Raw and trimmed QC reports
trimmed/ # Trimmed FASTQ files
aligned/ # Sorted BAM files (pre-filtering)
filtered/
shifted/ # Tn5-corrected BAM files
nfr/ # Nucleosome-free fragments (<150 bp)
mononuc/ # Mono-nucleosome fragments (150-300 bp)
peaks/
narrow/ # MACS2 narrowPeak files
idr/ # IDR-filtered reproducible peaks
signal/ # bigWig signal tracks
qc/
tss_enrichment