STAR — Spliced RNA-seq Aligner
Overview
STAR (Spliced Transcripts Alignment to a Reference) aligns RNA-seq reads to a genome in a splice-aware manner, identifying novel and annotated splice junctions in a single pass. It generates coordinate-sorted BAM files compatible with samtools, IGV, deeptools, and GATK. STAR's 2-pass mode re-aligns reads using junctions discovered in the first pass, improving sensitivity for novel splice sites. With --quantMode GeneCounts, STAR simultaneously produces gene-level read count tables without requiring a separate featureCounts or HTSeq step.
When to Use
- Aligning bulk RNA-seq reads to a reference genome when downstream tools require a BAM file (variant calling, visualization, deeptools)
- Running ENCODE-compliant RNA-seq pipelines that mandate genome alignment
- Discovering novel splice junctions and alternative splicing events in the dataset
- Generating gene count tables alongside BAM alignment in a single step with
--quantMode GeneCounts - Processing long reads or reads with high mismatch rates by tuning
--outFilterMismatchNmax - Use Salmon instead when you only need transcript/gene quantification and do not need a BAM file — Salmon is 20-50× faster
Prerequisites
- Software: STAR ≥ 2.7.0 (conda or compiled binary)
- Reference files: genome FASTA + GTF annotation (same assembly)
- RAM: 30–32 GB for human/mouse genome index; 8–16 GB for smaller genomes
- Disk: ~25 GB for human genome index, ~5–10 GB per sample BAM
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v STARfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run STARrather than bareSTAR.
# Install with conda (recommended)
conda install -c bioconda star
# Verify
STAR --version
# STAR_2.7.11a
# Or compile from source
git clone https://github.com/alexdobin/STAR
cd STAR/source && make STAR
Quick Start
# 1. Generate genome index (~30 min, run once)
STAR --runMode genomeGenerate \
--runThreadN 8 \
--genomeDir genome/star_index \
--genomeFastaFiles genome/GRCh38.fa \
--sjdbGTFfile genome/gencode.v47.gtf \
--sjdbOverhang 100 # ReadLength - 1
# 2. Align paired-end reads (~10-20 min)
STAR --runThreadN 8 \
--genomeDir genome/star_index \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample/
# 3. Index the BAM
samtools index results/sample/Aligned.sortedByCoord.out.bam
Workflow
Step 1: Prepare Reference Files
Download a genome FASTA and matching GTF annotation (same assembly version).
# Download GRCh38 genome and GENCODE annotation
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v47.primary_assembly.annotation.gtf.gz
mkdir -p genome/star_index
echo "Genome and GTF ready."
ls -lh GRCh38.primary_assembly.genome.fa gencode.v47.primary_assembly.annotation.gtf
Step 2: Generate Genome Index
Build the STAR genome index — required once per genome/read-length combination.
# Standard human genome index (requires ~32 GB RAM)
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir genome/star_index/ \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v47.primary_assembly.annotation.gtf \
--sjdbOverhang 100
# For small genomes (e.g., E. coli ~4.6 Mb), reduce genomeSAindexNbases
# STAR --runMode genomeGenerate \
# --genomeSAindexNbases 11 \
# --genomeDir genome/ecoli_index/ ...
echo "Index complete: $(ls genome/star_index/ | wc -l) files"
Step 3: Align RNA-seq Reads
Align single-end or paired-end FASTQ files to the indexed genome.
# Single-end alignment
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS NM MD \
--outFileNamePrefix results/sample1/
# Paired-end alignment
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS NM MD \
--outFileNamePrefix results/sample1/
echo "BAM: results/sample1/Aligned.sortedByCoord.out.bam"
Step 4: Run 2-Pass Alignment for Improved Sensitivity
Two-pass mode collects splice junctions from the first pass and uses them as annotation for the second pass.
# First pass — collect splice junctions
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype None \
--outFileNamePrefix pass1/sample1/
# Second pass — realign with all junctions from pass 1
SJ_FILES=$(ls pass1/*/SJ.out.tab | tr '\n' ' ')
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--sjdbFileChrStartEnd $SJ_FILES \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample1/
# Alternative: single-command 2-pass
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--twopassMode Basic \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample1/
Step 5: Check Alignment Statistics
Parse the alignment log to assess mapping rate and read quality.
# View the alignment summary
cat results/sample1/Log.final.out
# Parse key metrics with python
python3 - << 'EOF'
import re, sys
from pathlib import Path
log = Path("results/sample1/Log.final.out").read_text()
metrics = {}
for line in log.splitlines():
if "|" in line:
key, _, val = line.partition("|")
metrics[key.strip()] = val.strip()
print(f"Unique mapping: {metrics.get('Uniquely mapped reads %', 'N/A')}")
print(f"Multi-mapping: {metrics.get('% of reads mapped to multiple loci', 'N/A')}")
print(f"Too many mismatches:{metrics.get('% of reads unmapped: too many mismatches', 'N/A')}")
print(f"Total input reads: {metrics.get('Number of input reads', 'N/A')}")
EOF
Step 6: Generate Gene Count Tables
Enable simultaneous gene counting during alignment using --quantMode GeneCounts.
# Align and count simultaneously
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix results/sample1/
# ReadsPerGene.out.tab has 4 columns:
# gene_id unstranded stranded_fwd stranded_rev
head results/sample1/ReadsPerGene.out.tab
# Load into pandas (select column based on library strandedness)
python3 - << 'EOF'
import pandas as pd
df = pd.read_csv("results/sample1/ReadsPerGene.out.tab",
sep="\t", header=None, skiprows=4,
names=["gene_id", "unstranded", "fwd", "rev"])
# For unstranded library: use column 2 (unstranded)
counts = df.set_index("gene_id")["unstranded"]
print(f"Genes with counts > 0: {(counts > 0).sum()}")
print(counts[counts > 0].sort_values(ascending=False).head())
EOF
Key Parameters
| Parameter | Default | Range/Options | Effect |
|---|---|---|---|
--runThreadN | 1 | 1–64 | CPU threads for alignment |
--sjdbOverhang | 99 | Read |