ENCODE RNA-seq Pipeline

When to Use

User wants to run an RNA-seq processing pipeline from FASTQ to gene quantification
User asks about "RNA-seq pipeline", "STAR alignment", "RSEM", "gene expression quantification", or "Kallisto"
User needs to process bulk RNA-seq data with ENCODE-standard 2-pass STAR alignment
Example queries: "process my RNA-seq FASTQs", "quantify gene expression from RNA-seq", "run STAR and RSEM on my data"

Execute the ENCODE RNA-seq processing pipeline from raw FASTQ files through splice-aware alignment, gene/transcript quantification, and strand-specific signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.

Overview

RNA-seq measures transcriptome-wide gene expression by sequencing cDNA derived from cellular RNA. The ENCODE pipeline processes RNA-seq data through quality control, splice-aware alignment with STAR (2-pass mode), gene and transcript quantification with RSEM, optional fast pseudoalignment with Kallisto, and generation of strand-specific signal tracks as bigWig files.

Key design decisions: STAR 2-pass mode for maximum splice junction sensitivity, RSEM for accurate gene/transcript/isoform quantification including multi-mapped reads, stranded library protocol (dUTP/rf-stranded) as the ENCODE standard, and paired-end sequencing with a minimum of 30 million uniquely mapped reads per replicate.

Key Literature

Reference	Journal	Year	DOI	Relevance
Dobin et al. "STAR: ultrafast universal RNA-seq aligner"	Bioinformatics	2013	10.1093/bioinformatics/bts635	Splice-aware aligner (~12,000 citations)
Li & Dewey "RSEM: accurate transcript quantification from RNA-Seq data"	BMC Bioinformatics	2011	10.1186/1471-2105-12-323	Gene/transcript quantification (~6,000 citations)
Bray et al. "Near-optimal probabilistic RNA-seq quantification"	Nature Biotechnology	2016	10.1038/nbt.3519	Fast pseudoalignment (~4,000 citations)
Wang et al. "RSeQC: quality control of RNA-seq experiments"	Bioinformatics	2012	10.1093/bioinformatics/bts356	RNA-seq QC suite (~3,500 citations)
ENCODE Project Consortium "Expanded encyclopaedias"	Nature	2020	10.1038/s41586-020-2493-4	ENCODE Phase 3 standards
Frankish et al. "GENCODE 2021"	Nucleic Acids Research	2021	10.1093/nar/gkaa1087	Gene annotation reference

Pipeline Stages

FASTQ ──> FastQC / Trim Galore ──> STAR (2-pass) ──> Genome BAM + Transcriptome BAM
  |                                                        |              |
  |                  ┌─────────────────────────────────────┘              |
  |                  v                                                    v
  |         Signal Track Generation                              RSEM Quantification
  |          (strand-specific bigWig)                        (gene + transcript + isoform)
  |                  |                                                    |
  |                  v                                                    v
  |          Plus strand bigWig                                genes.results (TPM/FPKM)
  |          Minus strand bigWig                               isoforms.results
  |                                                                       |
  |         ┌───────────────────────────────────────────────────────────┘
  |         v
  |   Kallisto (optional fast pseudoalignment)
  |         |
  |         v
  └──> RSeQC + MultiQC ──> Aggregated QC Report

Stage Summary

Stage	Tool	Input	Output	Reference
1. QC & Trimming	FastQC, Trim Galore	Raw FASTQ	Trimmed FASTQ	references/01-qc-trimming.md
2. Alignment	STAR (2-pass)	Trimmed FASTQ	Genome BAM + Transcriptome BAM	references/02-star-alignment.md
3. Quantification	RSEM, Kallisto	Transcriptome BAM / FASTQ	Gene/transcript counts, TPM, FPKM	references/03-quantification.md
4. Signal Tracks	bedGraphToBigWig	STAR bedGraph	Strand-specific bigWig	references/04-signal-tracks.md
5. QC Metrics	RSeQC, MultiQC	BAM, counts	Strandedness, coverage, saturation	references/05-qc-metrics.md

Input Requirements

Required Files

RNA-seq FASTQ: Paired-end reads (ENCODE standard; single-end supported)
Reference genome: STAR-indexed genome with gene annotation (GRCh38 + GENCODE for human)
Gene annotation: GENCODE GTF (v38+ for human, vM27+ for mouse)

Sample Sheet Format

sample_id,read1,read2,replicate,strandedness
SAMPLE1_rep1,rna_R1.fq.gz,rna_R2.fq.gz,1,reverse
SAMPLE1_rep2,rna_R1.fq.gz,rna_R2.fq.gz,2,reverse

Strandedness: ENCODE uses dUTP-based stranded libraries. The resulting reads are reverse stranded (read 2 matches the sense strand). If unknown, the pipeline will auto-detect strandedness using RSeQC infer_experiment.py.

Library Strandedness

Protocol	Strandedness	RSEM flag	Kallisto flag	Common Usage
dUTP (ENCODE standard)	Reverse	`--strandedness reverse`	`--rf-stranded`	Most ENCODE RNA-seq
SMARTer / SMART-Seq2	Unstranded	`--strandedness none`	(default)	Single-cell, low-input
Illumina TruSeq Stranded	Reverse	`--strandedness reverse`	`--rf-stranded`	Standard bulk RNA-seq
Directional ligation	Forward	`--strandedness forward`	`--fr-stranded`	Some legacy protocols

QC Thresholds

Metric	Threshold	Category	Source
Total sequenced reads	>=30M PE reads	Read depth	ENCODE
Uniquely mapped reads	>=70% of total	Alignment	ENCODE
Multi-mapped reads	<10%	Alignment	ENCODE
rRNA rate	<10%	Sample quality	ENCODE
Strandedness agreement	>90%	Library prep	RSeQC
Exonic rate	>60%	Mapping quality	RSeQC
Gene body coverage	Relatively uniform (5'/3' bias <1.5)	RNA integrity	RSeQC
Duplication rate	<60%	Library complexity	Picard
Detected genes (TPM>1)	>12,000 (human)	Sensitivity	ENCODE
Saturation	Approaching plateau at sequencing depth	Depth sufficiency	RSeQC

Read Depth Guidelines

Application	Minimum Reads (PE)	Recommended	Notes
Gene-level expression	20M	30M	ENCODE minimum
Transcript-level expression	40M	60M	Isoform resolution requires more depth
Differential expression	20M per sample	30M per sample	3+ biological replicates per condition
Novel junction discovery	60M	100M+	STAR 2-pass mode benefits from depth
Fusion detection	50M	80M+	Chimeric reads are rare

Execution

Quick Start (Local Docker)

nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

SLURM HPC

nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

Google Cloud

nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'

AWS Batch

nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'

Cloud Cost Estimates

Platform	Instance	Cost/Sample	Time/Sample	Notes
GCP	n1-highmem-8	~$3-6	2-4 hours	STAR index loading dominates; preemptible recommended
AWS	r5.2xlarge	~$3-6	2-4 hours	r-series for STAR memory; spot recommended
Local	8 cores, 32GB	$0	3-6 hours	Docker required; STAR needs 32GB+ RAM
SLURM	8 cores, 32GB	Varies	2-4 hours	Singularity recommended

pipeline-rnaseq

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday