Bulk RNA-seq
Overview
This skill orchestrates a complete, defensible bulk RNA-seq differential-expression study, from raw sequencing reads to enriched pathways and figures. It is a router, not a reimplementation: most stages already have dedicated skills in this repo, and this skill connects them in the right order, fills the one real gap (raw reads → a gene-level counts matrix), and enforces the design and QC decisions that determine whether the final result is trustworthy.
"Defensible" means three things, applied throughout:
- Reproducible — pinned pipeline/tool versions, containers where possible, recorded parameters, fixed random seeds.
- Quality-gated — QC is inspected and acted on before, during, and after quantification, not skipped.
- Statistically sound — adequate replication, a design that matches the biology, counts handled correctly, and FDR-controlled testing.
The pipeline is: FastQC/trim → align/quant (STAR/Salmon) → counts → DE (pydeseq2) → enrichment (pathway-enrichment) → figures.
When to Use This Skill
Use this skill when the user wants to:
- Go from FASTQ files (or a sequencing run) to differentially expressed genes and pathways.
- Run or configure
nf-core/rnaseq, or align/quantify with STAR, Salmon, or featureCounts. - Turn Salmon/STAR/featureCounts output into a counts matrix ready for DESeq2/PyDESeq2.
- Design or sanity-check a bulk RNA-seq experiment (replicates, batch, strandedness) before committing compute.
- Scope an end-to-end RNA-seq analysis and decide which tools and skills to chain.
This is bulk RNA-seq (samples = biological specimens). For single-cell/nuclei data use scanpy; for the DE statistics alone use pydeseq2; for enrichment alone use pathway-enrichment.
The Pipeline at a Glance
flowchart TD
fastq["Raw FASTQ + samplesheet"] --> qc["FastQC + MultiQC"]
qc --> trim["Trim: fastp / Trim Galore"]
trim --> align["Align + quant: STAR and/or Salmon"]
align --> counts["Gene-level counts matrix"]
counts --> de["Differential expression"]
de --> enrich["Pathway / GSEA enrichment"]
de --> fig["Figures"]
enrich --> fig
nfcore["nf-core/rnaseq via nextflow skill"] -.->|"path A"| align
manual["Standalone recipes (this skill)"] -.->|"path B"| align
bridge["build_counts_matrix.py (this skill)"] -.-> counts
pydeseq2skill["pydeseq2 skill"] -.-> de
pwskill["pathway-enrichment skill"] -.-> enrich
vizskill["scientific-visualization skill"] -.-> fig
Two Upstream Paths — Pick One
The reads → counts stage can be run two ways. They produce equivalent gene counts; choose by context, then stay on that path.
Use Path A — nf-core/rnaseq when… | Use Path B — standalone tools when… |
|---|---|
| You want the field-standard, audited, citable pipeline with one command | You have a few samples and want to learn/inspect each step |
| Many samples, or you'll scale to HPC/cloud | No Nextflow/containers available, or a constrained environment |
| Reproducibility and a full MultiQC report matter most | You need a non-standard step the pipeline doesn't expose |
→ Drive it through the nextflow skill | → Follow references/upstream-manual.md |
When unsure, prefer Path A: nf-core/rnaseq already wires together FastQC → trimming → STAR/Salmon → quantification → tximport → MultiQC with sensible, reviewed defaults, which is the most defensible option. Path B exists for transparency and constrained setups.
Both paths converge on a gene-level counts matrix, after which the workflow is identical.
Setup
# This skill's glue (bridge + handoffs) — Python
uv pip install pytximport pandas
# Downstream skills install their own deps:
# pydeseq2 skill -> uv pip install pydeseq2
# pathway-enrichment skill -> uv pip install gseapy gprofiler-official
# Path A (nf-core): only Nextflow + a container engine are needed — see the `nextflow` skill.
# Path B (standalone tools): install via bioconda. Pin versions for reproducibility.
conda create -n rnaseq -c bioconda -c conda-forge \
fastqc fastp trim-galore "star=2.7.11b" "salmon=1.10.3" subread multiqc
Record the exact versions you use (pipeline revision, tool versions, reference genome + annotation release) — they belong in the methods section and make the analysis reproducible.
Quick Start
Path A — nf-core/rnaseq (recommended)
# 0. Validate the samplesheet first (catches the most common failures early)
python scripts/validate_samplesheet.py --samplesheet samplesheet.csv
# 1. Smoke-test the environment with tiny bundled data
nextflow run nf-core/rnaseq -r 3.26.0 -profile test,docker --outdir test_results
# 2. Real run: pin the revision, pick an aligner, pass a samplesheet + reference
nextflow run nf-core/rnaseq -r 3.26.0 \
-profile docker \
--input samplesheet.csv \
--genome GRCh38 \
--aligner star_salmon \
--outdir results \
-resume
nf-core/rnaseq runs tximport internally, so gene counts come out already merged — no bridge script needed. Use results/star_salmon/salmon.merged.gene_counts_length_scaled.tsv for DE. Samplesheet format, aligner choice, and outputs: references/upstream-nfcore.md. For engine/HPC/cloud/container detail, use the nextflow skill.
Path B — standalone STAR/Salmon (abbreviated)
fastqc -o qc/ reads/*.fastq.gz # 1. QC raw reads
fastp -i s1_R1.fq.gz -I s1_R2.fq.gz \
-o s1_R1.trim.fq.gz -O s1_R2.trim.fq.gz \
--thread 4 -j s1.fastp.json # 2. Trim adapters/low-quality
salmon quant -i salmon_index -l A \
-1 s1_R1.trim.fq.gz -2 s1_R2.trim.fq.gz \
--gcBias --seqBias -p 8 -o quant/s1 # 3. Quantify (per sample)
Full recipes (FastQC, fastp/Trim Galore, STAR index+align+--quantMode GeneCounts, Salmon decoy-aware index, featureCounts, strandedness): references/upstream-manual.md.
Counts → DE → enrichment (both paths)
# Path B only: assemble a gene x sample counts matrix + metadata template for PyDESeq2
python scripts/build_counts_matrix.py --from salmon \
--quant-dir quant/ --tx2gene tx2gene.tsv --output-dir counts/
# Then hand off (see the dedicated skills):
# pydeseq2: counts.csv + metadata.csv -> DE table (log2FC, padj, stat)
# pathway-enrichment: rank by `stat` (GSEA) or padj+|LFC| hit list (ORA)
# scientific-visualization / matplotlib: volcano, MA, heatmap, PCA, enrichment dotplot
Stage-by-Stage Workflow
Work top to bottom. Each stage names the skill or file that owns the detail. Don't skip the design/QC stages — they are where bulk RNA-seq studies most often go wrong.
- Design & sample sheet. Confirm ≥3 biological replicates per group, identify batch/confounders, and choose the comparison(s). Build the samplesheet and validate it with
scripts/validate_samplesheet.py. Rationale and rules:references/design-and-qc.md. - Raw-read QC. FastQC per file; aggregate with MultiQC. Check per-base quality, adapter content, duplication, and over-representation. Thresholds:
references/design-and-qc.md. - Trimming. Remove adapters and low-quality tails (via
fastporTrim Galore). Re-run FastQC to confirm. Recipes:references/upstream-manual.md(Path A does this for you). - Align / quantify. STAR (genome alignment +
--quantMode GeneCounts) and/or Salmon (transcript quasi-mapping, decoy-aware). Determine strandedness — it is easy to get wrong and silently halves your counts. Detail:references/upstream-manual.md; pipeline params:references/upstream-nfcore.md. - Build the counts matrix. Turn quant output into a gene × sample integer matrix and a metadata template (
scripts/build_counts_matrix.py). The estimated-count and gene-ID-mapping nuances live inreferences/counts-and-handoff.md. - **Differential expression →
pydeseq2