samtools — SAM/BAM/CRAM Alignment Toolkit
Overview
samtools is the standard command-line toolkit for processing sequence alignment files in SAM, BAM, and CRAM formats. It handles the complete alignment file lifecycle: format conversion, coordinate sorting, index creation, quality control statistics, read filtering, duplicate marking, and multi-file merging. samtools is a near-universal component of NGS pipelines between alignment (STAR, BWA) and downstream analysis (variant calling, peak calling, coverage).
When to Use
- Sorting BAM files by coordinate after alignment (required before indexing)
- Indexing sorted BAM files for random access and region queries
- Converting between SAM, BAM, and CRAM formats to save storage
- Generating alignment QC metrics: mapping rates, insert sizes, per-chromosome stats
- Filtering reads by mapping quality, FLAG bits, or genomic regions
- Marking or removing PCR duplicates before variant calling
- Merging multiple BAM files from different lanes or samples
- Calculating per-base depth or coverage breadth for target regions
- Use
pysaminstead for Python-native BAM manipulation in custom scripts - Use
deeptools bamCoverageinstead when you need normalized bigWig coverage tracks - Use
mosdepthinstead for whole-genome per-base depth (faster, parallelized)
Prerequisites
- Installation: samtools 1.17+ recommended
- Input requirements: SAM/BAM/CRAM files; CRAM requires FASTA reference
- Companion tools:
samtools faidxfor FASTA indexing;samtools sortbeforesamtools index
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v samtoolsfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run samtoolsrather than baresamtools.
# Bioconda (recommended)
conda install -c bioconda samtools
# Homebrew (macOS)
brew install samtools
# Verify
samtools --version | head -1
Quick Start
# Typical post-alignment workflow: sort → index → QC
samtools sort -@ 8 -o sorted.bam input.bam
samtools index sorted.bam
samtools flagstat sorted.bam
Core API
Module 1: BAM/SAM I/O and Format Conversion
Convert between SAM/BAM/CRAM formats and extract subsets.
# SAM → BAM (saves ~75% disk space)
samtools view -b -h input.sam -o output.bam
# BAM → CRAM (saves additional 40-50%)
samtools view -C -T reference.fa input.bam -o output.cram
# Filter: mapping quality ≥20, exclude unmapped (-F 4)
samtools view -q 20 -F 4 input.bam -o filtered.bam
# Extract specific region (requires index)
samtools view -h sorted.bam "chr1:1000000-2000000" -o region.bam
# Count reads matching filter
samtools view -c -F 4 input.bam
# Output: 45231923 (number of mapped reads)
# Extract reads as FASTQ (for realignment or de novo assembly)
samtools fastq -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -0 unpaired.fastq.gz input.bam
# Extract reads as FASTA
samtools fasta input.bam > reads.fasta
# Filter by read group
samtools view -r SAMPLE_001 multi_rg.bam -o sample001.bam
Module 2: Sorting and Indexing
Organize BAM files for efficient random access.
# Sort by coordinate (required before indexing)
samtools sort -@ 8 -m 2G input.bam -o sorted.bam
# Sort by read name (required for fixmate/markdup)
samtools sort -n -@ 8 input.bam -o namesorted.bam
# Index sorted BAM (creates sorted.bam.bai)
samtools index sorted.bam
# For chromosomes > 512 Mbp: use CSI index instead
samtools index -c sorted.bam
# Group reads by name (fast, for fixmate — no full sort needed)
samtools collate -o collated.bam input.bam
Module 3: Quality Control and Statistics
Generate alignment QC metrics and coverage reports.
# Quick summary: total, mapped, paired, properly paired
samtools flagstat sorted.bam
# Example output:
# 50000000 + 0 in total (QC-passed reads + QC-failed reads)
# 48523111 + 0 mapped (97.05% : N/A)
# 50000000 + 0 paired in sequencing
# 48490234 + 0 properly paired (96.98% : N/A)
# Per-chromosome mapped/unmapped read counts
samtools idxstats sorted.bam
# chr1 248956422 12345678 0
# chr2 242193529 11234567 0
# Comprehensive stats (insert sizes, GC content, base quality)
samtools stats -r reference.fa sorted.bam > full_stats.txt
grep "^SN" full_stats.txt | cut -f2,3 # Summary Numbers only
# Coverage report (min/max/mean per region/chromosome)
samtools coverage sorted.bam
# Per-base read depth for specific regions
samtools depth -b target_regions.bed sorted.bam > depth.txt
# Output: chr pos depth (e.g., chr1 1000 45)
# Statistics split by read group
samtools stats -S RG sorted.bam > per_rg_stats.txt
Module 4: Read Filtering and FLAG Operations
Filter reads using SAM FLAG bits for specific subsets.
# FLAG reference — common masks:
# 1 = paired 4 = unmapped
# 2 = proper pair 8 = mate unmapped
# 16 = reverse strand 64 = R1 (first in pair)
# 128 = R2 256= secondary alignment
# 1024 = PCR duplicate 2048= supplementary
# Extract properly paired, mapped reads (FLAG 2 set, 4 unset)
samtools view -f 2 -F 4 sorted.bam -o proper_pairs.bam
# Extract R1 reads only
samtools view -f 64 sorted.bam -o R1.bam
# Remove secondary and supplementary alignments
samtools view -F 2304 sorted.bam -o primary.bam
# Extract reads from BED file regions
samtools view -L regions.bed -b sorted.bam -o regions.bam
Module 5: Duplicate Handling
Mark or remove PCR duplicates before variant calling.
# Full duplicate marking workflow (collate → fixmate → sort → markdup)
samtools collate -@ 8 -o collated.bam input.bam
samtools fixmate -m -@ 8 collated.bam fixmated.bam
samtools sort -@ 8 -o sorted.bam fixmated.bam
samtools markdup -@ 8 sorted.bam marked.bam
samtools index marked.bam
# Check duplication rate
samtools flagstat marked.bam | grep "duplicates"
# Output: 2345678 + 0 duplicates (4.83%)
# NovaSeq optical duplicate detection (2500 pixel distance)
samtools markdup -d 2500 sorted.bam marked_novaseq.bam
# Remove duplicates instead of marking
samtools markdup -r sorted.bam deduped.bam
# Get duplication stats without writing output
samtools markdup -s sorted.bam /dev/null
Module 6: Multi-file Operations and Region Analysis
Merge BAM files and perform region-level analysis.
# Merge multiple BAM files (all must be sorted)
samtools merge -@ 8 merged.bam lane1.bam lane2.bam lane3.bam
# Merge files listed in a text file (one per line)
samtools merge -b bam_list.txt -@ 8 merged.bam
# Merge with read group tags from filenames
samtools merge -r merged.bam sample1.bam sample2.bam
# Extract specific chromosome region from merged output
samtools view -h merged.bam chr1 -b -o chr1.bam
Key Concepts
SAM FLAG Bits
FLAGS encode read properties as a sum of bit values. Common filtering patterns:
| Common Filter | -f (require) | -F (exclude) | Selects |
|---|---|---|---|
| Mapped reads | — | 4 | All aligned reads |
| Proper pairs | 2 | — | Properly paired, both mapped |
| Unique primary | — | 2308 | No secondary/supplementary/duplicate |
| R1 only | 64 | — | First-in-pair reads |
| Unmapped | 4 | — | Failed to align |
CRAM vs BAM vs SAM
| Format | Size | Speed | Requires |
|---|---|---|---|
| SAM | ~10× BAM | Slow I/O | Nothing |
| BAM | 1× | Fast | .bai index for random access |
| CRAM | ~0.6× BAM | Slightly slower | Reference FASTA + index |
Use CRAM for long-term storage; BAM for active analysis.
Common Workflows
Workflow 1: Post-Alignment QC and Preparation
Goal: Convert aligner output to analysis-ready BAM with QC metrics.
#!/bin/bash
SAMPLE="sample_001"
REF="reference.fa"
THREADS=8
# 1. Sort and index (aligner often outputs unsorted SAM/BAM)
samtools sort -@ $THREADS -o ${SAMPLE}.sorted.bam