Prokka Genome Annotation
Overview
Prokka is a command-line pipeline for rapid annotation of prokaryotic genomes (bacteria, archaea, and viruses). It uses a tiered search strategy: protein-coding genes (CDS) are predicted with Prodigal and searched first against a genus-specific database, then RefSeq proteins, then Pfam/TIGRFAMs HMMs. Non-coding RNA genes (rRNA, tRNA, tmRNA) are identified with Barrnap, Aragorn, and Infernal. Prokka processes a single FASTA assembly in minutes and outputs a comprehensive annotation in GFF3, GenBank, FASTA, and tabular formats.
When to Use
- Annotating a newly assembled bacterial or archaeal genome from Illumina, PacBio, or Nanopore assemblies
- Getting functional protein annotations (CDS with product names, EC numbers, GO terms) from a draft or complete genome
- Preparing annotation files for downstream comparative genomics (Roary pan-genome, OrthoFinder)
- Annotating viral or phage genomes when kingdom-specific databases are important
- Performing metagenome-assembled genome (MAG) annotation with the
--metagenomeflag - Parsing annotated outputs in Python with BioPython for downstream sequence or feature analysis
- Use PGAP (NCBI Prokaryotic Genome Annotation Pipeline) instead when the goal is NCBI GenBank submission with standards compliance
- Use Bakta instead for faster annotation with built-in NCBI-compatible outputs and a more regularly updated database
Prerequisites
- Software: Prokka ≥ 1.14, Perl 5, Prodigal, Barrnap, HMMER3, BLAST+, Aragorn, Infernal, tbl2asn
- Python packages (for output parsing):
biopython,pandas,matplotlib - Input: assembled genome in FASTA format (complete or draft with multiple contigs)
- Environment: conda strongly recommended to handle the Perl and C dependency stack
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v prokkafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run prokkarather than bareprokka.
# Install Prokka via conda/mamba (recommended)
conda install -c conda-forge -c bioconda prokka
# Or with mamba (faster)
mamba install -c conda-forge -c bioconda prokka
# Verify installation and database setup
prokka --version
# prokka 1.14.6
# Check that required tools are on PATH
prokka --depends
# prokka needs: awk, sed, grep, makeblastdb, blastp, hmmscan, ...
# Install Python parsing dependencies
pip install biopython pandas matplotlib
Quick Start
# Annotate a bacterial genome assembly — results in results/ directory
prokka genome.fasta \
--outdir results/ \
--prefix sample1 \
--kingdom Bacteria \
--cpus 4
# Check output summary
cat results/sample1.txt
# Organism: Genus species strain
# Contigs: 1
# Bases: 4639675
# CDS: 4140
# rRNA: 22
# tRNA: 86
echo "Annotation complete. Key output files:"
ls results/sample1.{gff,gbk,faa,ffn,tsv}
Workflow
Step 1: Install and Verify Prokka
Install Prokka and confirm all dependent tools are accessible in the current environment.
# Create a dedicated conda environment
conda create -n prokka_env -c conda-forge -c bioconda prokka python=3.10 -y
conda activate prokka_env
# Verify Prokka version and all tool dependencies
prokka --version
# prokka 1.14.6
prokka --depends
# Checking that required tools are installed...
# OK: makeblastdb is installed (2.13.0+)
# OK: blastp is installed (2.13.0+)
# OK: hmmscan is installed (3.3.2)
# OK: prodigal is installed (2.6.3)
# OK: barrnap is installed (0.9)
# Check available genus-specific databases bundled with Prokka
ls $(conda info --base)/envs/prokka_env/db/genus/
# Archaea Bacteria Mitochondria Viruses
# Install Python parsing tools
pip install biopython pandas matplotlib
Step 2: Prepare the Input Genome
Clean and rename contigs to comply with Prokka's header requirements before annotation.
from Bio import SeqIO
import re
# Load and inspect assembly
input_fasta = "genome.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))
print(f"Input assembly: {len(records)} contigs")
total_bases = sum(len(r) for r in records)
print(f"Total bases: {total_bases:,}")
print(f"Largest contig: {max(len(r) for r in records):,} bp")
print(f"N50 approx: see assembly stats tool")
# Rename contigs to short IDs compatible with Prokka (max 37 chars)
# Prokka requires: no spaces, no special characters in header
cleaned = []
for i, rec in enumerate(records, 1):
new_id = f"contig_{i:04d}"
new_rec = rec.__class__(rec.seq, id=new_id, description=f"len={len(rec.seq)}")
cleaned.append(new_rec)
SeqIO.write(cleaned, "genome_clean.fasta", "fasta")
print(f"\nWrote genome_clean.fasta with {len(cleaned)} renamed contigs")
# genome_clean.fasta: contig_0001 through contig_NNNN
# Alternatively, clean headers with a simple bash one-liner
awk '/^>/{print ">contig_" ++i; next}{print}' genome.fasta > genome_clean.fasta
# Filter out short contigs (< 200 bp) to reduce annotation noise
awk '/^>/{header=$0; next} length($0) >= 200 {print header; print}' \
genome_clean.fasta > genome_filtered.fasta
echo "Filtered assembly ready: $(grep -c '>' genome_filtered.fasta) contigs"
Step 3: Run Basic Prokka Annotation
Run Prokka with standard options for a bacterial genome, specifying genus/species for database selection.
# Basic annotation with genus/species hint (uses genus-specific protein database first)
prokka genome_clean.fasta \
--outdir annotation/ \
--prefix E_coli_K12 \
--kingdom Bacteria \
--genus Escherichia \
--species coli \
--strain K12 \
--cpus 8 \
--mincontiglen 200
# Expected runtime: 2–10 minutes for a typical 4–6 Mb bacterial genome
echo "Prokka annotation output files:"
ls annotation/
# E_coli_K12.err E_coli_K12.faa E_coli_K12.ffn
# E_coli_K12.fna E_coli_K12.gbk E_coli_K12.gff
# E_coli_K12.log E_coli_K12.sqn E_coli_K12.tbl
# E_coli_K12.tsv E_coli_K12.txt
Step 4: Parse Annotation Summary (TSV)
Load the TSV output for a quick overview of annotated features and their functional assignments.
import pandas as pd
# Load the annotation TSV (tab-delimited feature table)
tsv_file = "annotation/E_coli_K12.tsv"
df = pd.read_csv(tsv_file, sep="\t")
print(f"Total features: {len(df)}")
print(f"Columns: {list(df.columns)}")
# Columns: [locus_tag, ftype, length_bp, gene, EC_number, COG, product]
# Feature type summary
print("\nFeature type counts:")
print(df["ftype"].value_counts().to_string())
# CDS 4140
# tRNA 86
# rRNA 22
# tmRNA 1
# Functional gene annotations (non-hypothetical CDS)
cds_df = df[df["ftype"] == "CDS"].copy()
hypothetical = cds_df["product"].str.contains("hypothetical", case=False, na=True)
print(f"\nCDS with known function: {(~hypothetical).sum()}")
print(f"Hypothetical proteins: {hypothetical.sum()}")
# Genes with EC numbers (enzymes)
ec_annotated = cds_df[cds_df["EC_number"].notna() & (cds_df["EC_number"] != "")]
print(f"CDS with EC numbers: {len(ec_annotated)}")
print(ec_annotated[["locus_tag", "gene", "EC_number", "product"]].head(5).to_string(index=False))
Step 5: Parse GenBank Output with BioPython
Read the GenBank file to access per-gene sequences, qualifiers, and feature coordinates.
from Bio import SeqIO
import pandas as pd
# Parse GenBank file
gbk_file = "annotation/E_coli_K12.gbk"
records = list(SeqIO.parse(gbk_file, "genbank"))
print(f"Contigs in GenBank: {len(records)}")
# Iterate over CDS features and extract details
rows = []
for rec in records:
for feat in rec.features:
if feat.type != "CDS":
continue
qualifiers = feat.qualifiers
rows.append({
"contig": rec.id,
"locus_tag": qualifiers.get("locus_tag", ["?"])[0],
"gene": qualifiers.get("gene", [""]