ENCODE Pipeline Guide and Custom Workflow Generation

When to Use

User wants to understand ENCODE uniform analysis pipelines or run them on their own data
User asks about "ENCODE pipeline", "Nextflow", "WDL", "processing standards", or "pipeline requirements"
User needs to generate a custom Nextflow/WDL workflow based on ENCODE pipeline specifications
User wants to know compute requirements (CPU, GPU, memory, storage) for running pipelines
Example queries: "how do I run the ENCODE ChIP-seq pipeline?", "what are the compute requirements for Hi-C processing?", "generate a Nextflow pipeline for my ATAC-seq data"

Understand ENCODE pipelines, generate user-specific workflows in Nextflow/WDL, and manage compute resources for local, HPC, and cloud execution.

ENCODE Uniform Analysis Pipelines

ENCODE uses standardized pipelines for each assay type, ensuring reproducibility across all datasets. All pipelines are:

Open source: GitHub (github.com/ENCODE-DCC)
Containerized: Docker and Singularity images
Written in WDL: Workflow Description Language (Cromwell execution engine)
Portable: Local, HPC (SLURM, SGE, PBS), or cloud (Google Cloud, AWS, Azure)

Pipeline Repository Map

Assay	GitHub Repository	Primary Tools	Container
ChIP-seq	`ENCODE-DCC/chip-seq-pipeline2`	BWA, MACS2, IDR	`encodedcc/chip-seq-pipeline:v2.2.1`
ATAC-seq	`ENCODE-DCC/atac-seq-pipeline`	Bowtie2, MACS2, IDR	`encodedcc/atac-seq-pipeline:v2.2.0`
RNA-seq	`ENCODE-DCC/rna-seq-pipeline`	STAR, RSEM	`encodedcc/rna-seq-pipeline:v1.2.0`
DNase-seq	`ENCODE-DCC/dnase-seq-pipeline`	BWA, Hotspot2	`encodedcc/dnase-seq-pipeline`
WGBS	`ENCODE-DCC/dna-me-pipeline`	Bismark/bwa-meth, MethylDackel	`encodedcc/dna-me-pipeline`
Hi-C	`ENCODE-DCC/hic-pipeline`	BWA, Juicer, HiCCUPS	`encodedcc/hic-pipeline`
scRNA-seq	`ENCODE-DCC/scrna-seq-pipeline`	STARsolo, Cellranger	—
scATAC-seq	`ENCODE-DCC/scatac-seq-pipeline`	Chromap, SnapATAC2	—
CUT&RUN	`ENCODE-DCC/cutandrun-pipeline`	Bowtie2, SEACR/MACS2	—

Literature Foundation

Reference	Year	Relevance	Citations
Di Tommaso et al. "Nextflow enables reproducible computational workflows"	2017	Nextflow workflow manager	~2,800
Ewels et al. "The nf-core framework for community-curated bioinformatics pipelines"	2020	nf-core community pipelines	~1,900
Kurtzer et al. "Singularity: Scientific containers for mobility of compute"	2017	Singularity containers for HPC	~2,500
Merkel "Docker: lightweight Linux containers for consistent development and deployment"	2014	Docker containerization	~3,000
ENCODE Project Consortium "Expanded encyclopaedias of DNA elements"	2020	ENCODE Phase 3 standards	~1,200
Gruening et al. "Bioconda: sustainable and comprehensive software distribution"	2018	Bioconda packaging ecosystem	~1,400

Pipeline Output Types by Assay

ChIP-seq Pipeline

Output Type	Format	Description	Use For
alignments	bam	Filtered, deduplicated	Reprocessing, visualization
signal of unique reads	bigWig	Unique read signal	Genome browser
fold change over control	bigWig	Normalized signal	Comparative visualization
IDR thresholded peaks	bed narrowPeak	Reproducible peaks	Peak analysis (gold standard)
pseudoreplicated peaks	bed narrowPeak	Single-replicate peaks	When only 1 replicate
optimal IDR peaks	bed narrowPeak	Pooled replicate peaks	Most complete peak set

ATAC-seq Pipeline

Output Type	Format	Description	Use For
alignments	bam	No-mito, deduplicated	Reprocessing
signal of unique reads	bigWig	Signal track	Genome browser
IDR thresholded peaks	bed narrowPeak	Reproducible peaks	Accessibility analysis
pseudoreplicated peaks	bed narrowPeak	Single-replicate	Backup peaks

RNA-seq Pipeline

Output Type	Format	Description	Use For
alignments	bam	STAR-aligned	Visualization, reprocessing
gene quantifications	tsv	Gene-level counts (RSEM)	Differential expression
transcript quantifications	tsv	Transcript-level counts	Isoform analysis
signal of unique reads	bigWig	Strand-specific signal	Genome browser

WGBS Pipeline

Output Type	Format	Description	Use For
alignments	bam	Bisulfite-converted	Reprocessing
methylation state at CpG	bed bedMethyl	Per-CpG levels	Methylation analysis

Hi-C Pipeline

Output Type	Format	Description	Use For
contact matrix	hic	Interaction frequencies	TAD/compartment calling
chromatin interactions	bedpe	Called loops	Loop analysis

Choosing the Right Output Files

Decision Table

Analysis Goal	File Type	Output Type	Priority
Visualization	bigWig	fold change over control (ChIP) / signal of unique reads (others)	preferred_default=True
Peak overlap	bed narrowPeak	IDR thresholded peaks	Highest confidence
Quantitative	tsv / bed	gene quantifications / methylation state	Pipeline defaults
Custom processing	fastq	reads	When ENCODE pipeline doesn't match

encode_list_files(experiment_accession="ENCSR...", preferred_default=True)

Step 1: Assess User Compute Resources

Before generating any pipeline, check available resources:

System Check Commands

# CPU cores
nproc                              # Linux
sysctl -n hw.ncpu                  # macOS

# Memory
free -h                            # Linux
sysctl -n hw.memsize | awk '{print $1/1024/1024/1024 " GB"}'  # macOS

# Disk space
df -h /path/to/data/

# GPU (if applicable)
nvidia-smi                         # NVIDIA GPU
# Note: Most ENCODE pipelines do NOT require GPU

# Docker availability
docker --version
docker info | grep "Total Memory"

# Singularity (for HPC)
singularity --version

Minimum Resource Requirements by Pipeline

Pipeline	Min CPU	Min RAM	Min Disk	GPU	Time Estimate (per sample)
ChIP-seq	4 cores	16 GB	50 GB	No	2–4 hours
ATAC-seq	4 cores	16 GB	50 GB	No	2–4 hours
RNA-seq	8 cores	32 GB	100 GB	No	4–8 hours (index build)
WGBS	8 cores	48 GB	200 GB	No	12–24 hours
Hi-C	8 cores	64 GB	200 GB	No	8–16 hours
scRNA-seq	8 cores	64 GB	100 GB	No	4–8 hours

Resource Scaling

CPU: Alignment steps are parallelizable; doubling cores approximately halves alignment time
RAM: Genome index loading is the bottleneck; STAR requires ~32 GB for human genome
Disk: FASTQ + BAM + intermediate files can exceed 100 GB per sample
Network: ENCODE downloads at ~50–200 MB/s; plan for transfer time

Step 2: Generate Custom Nextflow Workflows

When the user needs to run ENCODE-style processing, generate Nextflow workflows that mirror ENCODE pipeline logic.

Why Nextflow Over WDL

Broader adoption: Nextflow is used by nf-core, most HPC centers, and cloud platforms
Native container support: Docker, Singularity, Podman
Cloud integration: AWS Batch, Google Cloud Life Sciences, Azure Batch natively
Resource management: Built-in CPU/memory/time limits per process
Resume capability: Failed runs restart from last successful step

Nextflow Pipeline Template

#!/usr/bin/env nextflow
nextflow.enable.dsl=2

// Pipeline parameters
params.reads         = null          // Input FASTQ path
params.genome        =

pipeline-guide

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes