ENCODE RNA-seq Pipeline
When to Use
- User wants to run an RNA-seq processing pipeline from FASTQ to gene quantification
- User asks about "RNA-seq pipeline", "STAR alignment", "RSEM", "gene expression quantification", or "Kallisto"
- User needs to process bulk RNA-seq data with ENCODE-standard 2-pass STAR alignment
- Example queries: "process my RNA-seq FASTQs", "quantify gene expression from RNA-seq", "run STAR and RSEM on my data"
Execute the ENCODE RNA-seq processing pipeline from raw FASTQ files through splice-aware
alignment, gene/transcript quantification, and strand-specific signal track generation.
This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform
analysis standards.
Overview
RNA-seq measures transcriptome-wide gene expression by sequencing cDNA derived from
cellular RNA. The ENCODE pipeline processes RNA-seq data through quality control,
splice-aware alignment with STAR (2-pass mode), gene and transcript quantification
with RSEM, optional fast pseudoalignment with Kallisto, and generation of strand-specific
signal tracks as bigWig files.
Key design decisions: STAR 2-pass mode for maximum splice junction sensitivity, RSEM
for accurate gene/transcript/isoform quantification including multi-mapped reads,
stranded library protocol (dUTP/rf-stranded) as the ENCODE standard, and paired-end
sequencing with a minimum of 30 million uniquely mapped reads per replicate.
Key Literature
| Reference | Journal | Year | DOI | Relevance |
|---|
| Dobin et al. "STAR: ultrafast universal RNA-seq aligner" | Bioinformatics | 2013 | 10.1093/bioinformatics/bts635 | Splice-aware aligner (~12,000 citations) |
| Li & Dewey "RSEM: accurate transcript quantification from RNA-Seq data" | BMC Bioinformatics | 2011 | 10.1186/1471-2105-12-323 | Gene/transcript quantification (~6,000 citations) |
| Bray et al. "Near-optimal probabilistic RNA-seq quantification" | Nature Biotechnology | 2016 | 10.1038/nbt.3519 | Fast pseudoalignment (~4,000 citations) |
| Wang et al. "RSeQC: quality control of RNA-seq experiments" | Bioinformatics | 2012 | 10.1093/bioinformatics/bts356 | RNA-seq QC suite (~3,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Frankish et al. "GENCODE 2021" | Nucleic Acids Research | 2021 | 10.1093/nar/gkaa1087 | Gene annotation reference |
Pipeline Stages
FASTQ ──> FastQC / Trim Galore ──> STAR (2-pass) ──> Genome BAM + Transcriptome BAM
| | |
| ┌─────────────────────────────────────┘ |
| v v
| Signal Track Generation RSEM Quantification
| (strand-specific bigWig) (gene + transcript + isoform)
| | |
| v v
| Plus strand bigWig genes.results (TPM/FPKM)
| Minus strand bigWig isoforms.results
| |
| ┌───────────────────────────────────────────────────────────┘
| v
| Kallisto (optional fast pseudoalignment)
| |
| v
└──> RSeQC + MultiQC ──> Aggregated QC Report
Stage Summary
| Stage | Tool | Input | Output | Reference |
|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | STAR (2-pass) | Trimmed FASTQ | Genome BAM + Transcriptome BAM | references/02-star-alignment.md |
| 3. Quantification | RSEM, Kallisto | Transcriptome BAM / FASTQ | Gene/transcript counts, TPM, FPKM | references/03-quantification.md |
| 4. Signal Tracks | bedGraphToBigWig | STAR bedGraph | Strand-specific bigWig | references/04-signal-tracks.md |
| 5. QC Metrics | RSeQC, MultiQC | BAM, counts | Strandedness, coverage, saturation | references/05-qc-metrics.md |
Input Requirements
Required Files
- RNA-seq FASTQ: Paired-end reads (ENCODE standard; single-end supported)
- Reference genome: STAR-indexed genome with gene annotation (GRCh38 + GENCODE for human)
- Gene annotation: GENCODE GTF (v38+ for human, vM27+ for mouse)
Sample Sheet Format
sample_id,read1,read2,replicate,strandedness
SAMPLE1_rep1,rna_R1.fq.gz,rna_R2.fq.gz,1,reverse
SAMPLE1_rep2,rna_R1.fq.gz,rna_R2.fq.gz,2,reverse
Strandedness: ENCODE uses dUTP-based stranded libraries. The resulting reads are
reverse stranded (read 2 matches the sense strand). If unknown, the pipeline will
auto-detect strandedness using RSeQC infer_experiment.py.
Library Strandedness
| Protocol | Strandedness | RSEM flag | Kallisto flag | Common Usage |
|---|
| dUTP (ENCODE standard) | Reverse | --strandedness reverse | --rf-stranded | Most ENCODE RNA-seq |
| SMARTer / SMART-Seq2 | Unstranded | --strandedness none | (default) | Single-cell, low-input |
| Illumina TruSeq Stranded | Reverse | --strandedness reverse | --rf-stranded | Standard bulk RNA-seq |
| Directional ligation | Forward | --strandedness forward | --fr-stranded | Some legacy protocols |
QC Thresholds
| Metric | Threshold | Category | Source |
|---|
| Total sequenced reads | >=30M PE reads | Read depth | ENCODE |
| Uniquely mapped reads | >=70% of total | Alignment | ENCODE |
| Multi-mapped reads | <10% | Alignment | ENCODE |
| rRNA rate | <10% | Sample quality | ENCODE |
| Strandedness agreement | >90% | Library prep | RSeQC |
| Exonic rate | >60% | Mapping quality | RSeQC |
| Gene body coverage | Relatively uniform (5'/3' bias <1.5) | RNA integrity | RSeQC |
| Duplication rate | <60% | Library complexity | Picard |
| Detected genes (TPM>1) | >12,000 (human) | Sensitivity | ENCODE |
| Saturation | Approaching plateau at sequencing depth | Depth sufficiency | RSeQC |
Read Depth Guidelines
| Application | Minimum Reads (PE) | Recommended | Notes |
|---|
| Gene-level expression | 20M | 30M | ENCODE minimum |
| Transcript-level expression | 40M | 60M | Isoform resolution requires more depth |
| Differential expression | 20M per sample | 30M per sample | 3+ biological replicates per condition |
| Novel junction discovery | 60M | 100M+ | STAR 2-pass mode benefits from depth |
| Fusion detection | 50M | 80M+ | Chimeric reads are rare |
Execution
Quick Start (Local Docker)
nextflow run scripts/main.nf \
-profile local \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
SLURM HPC
nextflow run scripts/main.nf \
-profile slurm \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
Google Cloud
nextflow run scripts/main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 'gs://bucket/results/'
AWS Batch
nextflow run scripts/main.nf \
-profile aws \
--reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 's3://bucket/results/'
Cloud Cost Estimates
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|
| GCP | n1-highmem-8 | ~$3-6 | 2-4 hours | STAR index loading dominates; preemptible recommended |
| AWS | r5.2xlarge | ~$3-6 | 2-4 hours | r-series for STAR memory; spot recommended |
| Local | 8 cores, 32GB | $0 | 3-6 hours | Docker required; STAR needs 32GB+ RAM |
| SLURM | 8 cores, 32GB | Varies | 2-4 hours | Singularity recommended |