Roary Pan-Genome Pipeline
Overview
Roary is a high-throughput pan-genome pipeline for prokaryotes that takes per-sample GFF3 annotations (typically from Prokka or Bakta) and produces a clustered gene presence/absence matrix across the entire input set. It first reduces redundancy with CD-HIT iterative clustering, then performs an all-vs-all BLASTP within each pre-cluster, and finally applies MCL graph clustering to define orthologous gene families. The output partitions the gene space into core (≥ 99 %), soft-core (95–99 %), shell (15–95 %), and cloud (< 15 %) genes and optionally builds a concatenated core-gene alignment suitable for phylogenetic inference.
When to Use
- Computing a pan-genome from a set of bacterial isolate annotations (10–10,000 genomes)
- Producing a
gene_presence_absence.csvmatrix for downstream GWAS, accessory-gene mining, or core-gene phylogenetics - Building a concatenated core-gene multi-FASTA alignment for ML/Bayesian phylogenetic trees
- Generating a pan-genome reference FASTA to use as a non-redundant gene catalog
- Comparative genomics across closely related strains where >95 % nucleotide identity is expected
- Use Panaroo instead when assemblies are highly fragmented or annotations are noisy (Panaroo aggressively cleans annotation errors)
- Use PIRATE instead when paralog-aware clustering with multiple identity thresholds is needed
- Use PPanGGOLiN instead when graph-based, statistically grounded gene-family partitioning is preferred over fixed-frequency cutoffs
Prerequisites
- Software: Roary ≥ 3.13, Perl 5, BLAST+, CD-HIT, MCL, BEDTools, PRANK or MAFFT (for
-ecore alignment), FastTree (optional) - Python packages (for output parsing):
pandas,matplotlib,seaborn,biopython,dendropy - Input: per-sample GFF3 files with embedded FASTA at the end (Prokka/Bakta default output)
- Hardware: ≥ 8 GB RAM for ~50 genomes; 32 GB+ recommended for ~500 genomes
- Naming: each GFF3 filename becomes the sample column header in the output matrix; use
sample_id.gffstyle
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v roaryfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run roaryrather than bareroary.
# Install Roary via conda/mamba (recommended)
mamba install -c conda-forge -c bioconda roary
# Verify installation
roary --version
# 3.13.0
# Verify dependent tools
which cd-hit blastp mcl bedtools mafft
# /opt/conda/bin/cd-hit
# /opt/conda/bin/blastp
# /opt/conda/bin/mcl
# /opt/conda/bin/bedtools
# /opt/conda/bin/mafft
# Install Python parsing dependencies
pip install pandas matplotlib seaborn biopython dendropy
Quick Start
# Run Roary on all GFF3 files in current directory; emit core gene alignment
roary -e --mafft -p 8 -o pangenome -f roary_out/ *.gff
# Inspect summary statistics
cat roary_out/summary_statistics.txt
# Core genes (99% <= strains <= 100%) 2823
# Soft core genes (95% <= strains < 99%) 78
# Shell genes (15% <= strains < 95%) 1542
# Cloud genes ( 0% <= strains < 15%) 3104
# Total genes 7547
Workflow
Step 1: Install Roary and Verify Dependencies
Install Roary in a dedicated environment to avoid Perl module conflicts.
# Create a dedicated conda environment
mamba create -n roary_env -c conda-forge -c bioconda roary mafft fasttree python=3.11 -y
mamba activate roary_env
# Verify Roary
roary --version
# 3.13.0
# Confirm pipeline dependencies are reachable
for tool in cd-hit blastp mcl bedtools mafft FastTree; do
if command -v $tool >/dev/null; then
echo "OK: $tool"
else
echo "MISSING: $tool"
fi
done
# Install Python parsing tools
pip install pandas matplotlib seaborn biopython dendropy
Step 2: Prepare Per-Sample GFF3 Files from Prokka or Bakta
Roary requires GFF3 files with an embedded FASTA section (the ##FASTA block). Both Prokka .gff and Bakta .gff3 outputs satisfy this format.
# Verify GFF3 files include the embedded FASTA section
mkdir -p roary_input/
cp annotations/*/sample*.gff roary_input/
for GFF in roary_input/*.gff; do
if grep -q "^##FASTA" "$GFF"; then
echo "OK: $GFF"
else
echo "MISSING ##FASTA in: $GFF"
fi
done
# Roary uses GFF filename (without .gff) as the sample column name.
# Rename or symlink to ensure unique, descriptive names.
cd roary_input/
ls *.gff | head
# strain_A.gff strain_B.gff strain_C.gff
# Sanity-check sample size and unique names
from pathlib import Path
gff_files = sorted(Path("roary_input/").glob("*.gff"))
print(f"GFF files: {len(gff_files)}")
names = [g.stem for g in gff_files]
duplicates = [n for n in names if names.count(n) > 1]
assert not duplicates, f"Duplicate sample names: {set(duplicates)}"
# Show first 5
for g in gff_files[:5]:
size_mb = g.stat().st_size / 1e6
print(f" {g.name}: {size_mb:.1f} MB")
Step 3: Run Roary with Core Gene Alignment
The -e --mafft combination produces the concatenated core-gene alignment used downstream for phylogenetics.
# Run Roary on the prepared GFF directory
# -e: extract core gene alignment per gene, then concatenate
# --mafft: use MAFFT for alignment (faster than default PRANK)
# -p 8: 8 parallel processes
# -i 95: min BLASTP percentage identity (default 95)
# -cd 99: % isolates a gene must be in to be "core" (default 99)
roary -e --mafft \
-p 8 \
-i 95 \
-cd 99 \
-o pangenome \
-f roary_out/ \
roary_input/*.gff
# Expected runtime:
# ~50 genomes: 10–20 min
# ~500 genomes: 4–8 hours
ls roary_out/
# accessory_binary_genes.fa gene_presence_absence.csv
# accessory_binary_genes.fa.newick gene_presence_absence.Rtab
# accessory.header.embl number_of_conserved_genes.Rtab
# accessory.tab number_of_genes_in_pan_genome.Rtab
# blast_identity_frequency.Rtab number_of_new_genes.Rtab
# clustered_proteins number_of_unique_genes.Rtab
# core_accessory.header.embl pan_genome_reference.fa
# core_accessory.tab summary_statistics.txt
# core_gene_alignment.aln
Step 4: Parse the Gene Presence/Absence Matrix
The CSV output is the canonical pan-genome matrix: rows are gene families, columns include metadata and one column per sample (filled with locus tags or empty).
import pandas as pd
import numpy as np
df = pd.read_csv("roary_out/gene_presence_absence.csv", low_memory=False)
print(f"Gene families: {len(df)}")
print(f"Columns: {len(df.columns)}")
# Standard metadata columns end with 'Inference', then sample columns follow
metadata_cols = list(df.columns[:14])
sample_cols = list(df.columns[14:])
print(f"Samples: {len(sample_cols)}")
# Build a binary presence/absence matrix (1 = locus tag present, 0 = empty)
binary = df[sample_cols].notna().astype(int)
binary.index = df["Gene"]
print(f"Binary matrix shape: {binary.shape}")
# Pan-genome partition by frequency
n_samples = len(sample_cols)
freq = binary.sum(axis=1) / n_samples
core = freq[freq >= 0.99]
soft_core = freq[(freq >= 0.95) & (freq < 0.99)]
shell = freq[(freq >= 0.15) & (freq < 0.95)]
cloud = freq[freq < 0.15]
print(f"\nPan-genome partition (n={n_samples} genomes):")
print(f" Core (>=99 %): {len(core):>5}")
print(f" Soft-core (95–99 %): {len(soft_core):>5}")
print(f" Shell (15–95 %): {len(shell):>5}")
print(f" Cloud (<15 %): {len(cloud):>5}")
print(f" Total : {len(freq):>5}")
Step 5: Visualize the Pan-Genome Frequency Distribution
A frequency histogram exposes whether the input set has a U-shaped (open) or core-skewed (closed) pan-genome.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d