ENCODE Pipeline Guide and Custom Workflow Generation
When to Use
- User wants to understand ENCODE uniform analysis pipelines or run them on their own data
- User asks about "ENCODE pipeline", "Nextflow", "WDL", "processing standards", or "pipeline requirements"
- User needs to generate a custom Nextflow/WDL workflow based on ENCODE pipeline specifications
- User wants to know compute requirements (CPU, GPU, memory, storage) for running pipelines
- Example queries: "how do I run the ENCODE ChIP-seq pipeline?", "what are the compute requirements for Hi-C processing?", "generate a Nextflow pipeline for my ATAC-seq data"
Understand ENCODE pipelines, generate user-specific workflows in Nextflow/WDL, and manage compute resources for local, HPC, and cloud execution.
ENCODE Uniform Analysis Pipelines
ENCODE uses standardized pipelines for each assay type, ensuring reproducibility across all datasets. All pipelines are:
- Open source: GitHub (github.com/ENCODE-DCC)
- Containerized: Docker and Singularity images
- Written in WDL: Workflow Description Language (Cromwell execution engine)
- Portable: Local, HPC (SLURM, SGE, PBS), or cloud (Google Cloud, AWS, Azure)
Pipeline Repository Map
| Assay | GitHub Repository | Primary Tools | Container |
|---|
| ChIP-seq | ENCODE-DCC/chip-seq-pipeline2 | BWA, MACS2, IDR | encodedcc/chip-seq-pipeline:v2.2.1 |
| ATAC-seq | ENCODE-DCC/atac-seq-pipeline | Bowtie2, MACS2, IDR | encodedcc/atac-seq-pipeline:v2.2.0 |
| RNA-seq | ENCODE-DCC/rna-seq-pipeline | STAR, RSEM | encodedcc/rna-seq-pipeline:v1.2.0 |
| DNase-seq | ENCODE-DCC/dnase-seq-pipeline | BWA, Hotspot2 | encodedcc/dnase-seq-pipeline |
| WGBS | ENCODE-DCC/dna-me-pipeline | Bismark/bwa-meth, MethylDackel | encodedcc/dna-me-pipeline |
| Hi-C | ENCODE-DCC/hic-pipeline | BWA, Juicer, HiCCUPS | encodedcc/hic-pipeline |
| scRNA-seq | ENCODE-DCC/scrna-seq-pipeline | STARsolo, Cellranger | — |
| scATAC-seq | ENCODE-DCC/scatac-seq-pipeline | Chromap, SnapATAC2 | — |
| CUT&RUN | ENCODE-DCC/cutandrun-pipeline | Bowtie2, SEACR/MACS2 | — |
Literature Foundation
| Reference | Year | Relevance | Citations |
|---|
| Di Tommaso et al. "Nextflow enables reproducible computational workflows" | 2017 | Nextflow workflow manager | ~2,800 |
| Ewels et al. "The nf-core framework for community-curated bioinformatics pipelines" | 2020 | nf-core community pipelines | ~1,900 |
| Kurtzer et al. "Singularity: Scientific containers for mobility of compute" | 2017 | Singularity containers for HPC | ~2,500 |
| Merkel "Docker: lightweight Linux containers for consistent development and deployment" | 2014 | Docker containerization | ~3,000 |
| ENCODE Project Consortium "Expanded encyclopaedias of DNA elements" | 2020 | ENCODE Phase 3 standards | ~1,200 |
| Gruening et al. "Bioconda: sustainable and comprehensive software distribution" | 2018 | Bioconda packaging ecosystem | ~1,400 |
Pipeline Output Types by Assay
ChIP-seq Pipeline
| Output Type | Format | Description | Use For |
|---|
| alignments | bam | Filtered, deduplicated | Reprocessing, visualization |
| signal of unique reads | bigWig | Unique read signal | Genome browser |
| fold change over control | bigWig | Normalized signal | Comparative visualization |
| IDR thresholded peaks | bed narrowPeak | Reproducible peaks | Peak analysis (gold standard) |
| pseudoreplicated peaks | bed narrowPeak | Single-replicate peaks | When only 1 replicate |
| optimal IDR peaks | bed narrowPeak | Pooled replicate peaks | Most complete peak set |
ATAC-seq Pipeline
| Output Type | Format | Description | Use For |
|---|
| alignments | bam | No-mito, deduplicated | Reprocessing |
| signal of unique reads | bigWig | Signal track | Genome browser |
| IDR thresholded peaks | bed narrowPeak | Reproducible peaks | Accessibility analysis |
| pseudoreplicated peaks | bed narrowPeak | Single-replicate | Backup peaks |
RNA-seq Pipeline
| Output Type | Format | Description | Use For |
|---|
| alignments | bam | STAR-aligned | Visualization, reprocessing |
| gene quantifications | tsv | Gene-level counts (RSEM) | Differential expression |
| transcript quantifications | tsv | Transcript-level counts | Isoform analysis |
| signal of unique reads | bigWig | Strand-specific signal | Genome browser |
WGBS Pipeline
| Output Type | Format | Description | Use For |
|---|
| alignments | bam | Bisulfite-converted | Reprocessing |
| methylation state at CpG | bed bedMethyl | Per-CpG levels | Methylation analysis |
Hi-C Pipeline
| Output Type | Format | Description | Use For |
|---|
| contact matrix | hic | Interaction frequencies | TAD/compartment calling |
| chromatin interactions | bedpe | Called loops | Loop analysis |
Choosing the Right Output Files
Decision Table
| Analysis Goal | File Type | Output Type | Priority |
|---|
| Visualization | bigWig | fold change over control (ChIP) / signal of unique reads (others) | preferred_default=True |
| Peak overlap | bed narrowPeak | IDR thresholded peaks | Highest confidence |
| Quantitative | tsv / bed | gene quantifications / methylation state | Pipeline defaults |
| Custom processing | fastq | reads | When ENCODE pipeline doesn't match |
encode_list_files(experiment_accession="ENCSR...", preferred_default=True)
Step 1: Assess User Compute Resources
Before generating any pipeline, check available resources:
System Check Commands
# CPU cores
nproc # Linux
sysctl -n hw.ncpu # macOS
# Memory
free -h # Linux
sysctl -n hw.memsize | awk '{print $1/1024/1024/1024 " GB"}' # macOS
# Disk space
df -h /path/to/data/
# GPU (if applicable)
nvidia-smi # NVIDIA GPU
# Note: Most ENCODE pipelines do NOT require GPU
# Docker availability
docker --version
docker info | grep "Total Memory"
# Singularity (for HPC)
singularity --version
Minimum Resource Requirements by Pipeline
| Pipeline | Min CPU | Min RAM | Min Disk | GPU | Time Estimate (per sample) |
|---|
| ChIP-seq | 4 cores | 16 GB | 50 GB | No | 2–4 hours |
| ATAC-seq | 4 cores | 16 GB | 50 GB | No | 2–4 hours |
| RNA-seq | 8 cores | 32 GB | 100 GB | No | 4–8 hours (index build) |
| WGBS | 8 cores | 48 GB | 200 GB | No | 12–24 hours |
| Hi-C | 8 cores | 64 GB | 200 GB | No | 8–16 hours |
| scRNA-seq | 8 cores | 64 GB | 100 GB | No | 4–8 hours |
Resource Scaling
- CPU: Alignment steps are parallelizable; doubling cores approximately halves alignment time
- RAM: Genome index loading is the bottleneck; STAR requires ~32 GB for human genome
- Disk: FASTQ + BAM + intermediate files can exceed 100 GB per sample
- Network: ENCODE downloads at ~50–200 MB/s; plan for transfer time
Step 2: Generate Custom Nextflow Workflows
When the user needs to run ENCODE-style processing, generate Nextflow workflows that mirror ENCODE pipeline logic.
Why Nextflow Over WDL
- Broader adoption: Nextflow is used by nf-core, most HPC centers, and cloud platforms
- Native container support: Docker, Singularity, Podman
- Cloud integration: AWS Batch, Google Cloud Life Sciences, Azure Batch natively
- Resource management: Built-in CPU/memory/time limits per process
- Resume capability: Failed runs restart from last successful step
Nextflow Pipeline Template
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// Pipeline parameters
params.reads = null // Input FASTQ path
params.genome =