Prokka Genome Annotation

Overview

Prokka is a command-line pipeline for rapid annotation of prokaryotic genomes (bacteria, archaea, and viruses). It uses a tiered search strategy: protein-coding genes (CDS) are predicted with Prodigal and searched first against a genus-specific database, then RefSeq proteins, then Pfam/TIGRFAMs HMMs. Non-coding RNA genes (rRNA, tRNA, tmRNA) are identified with Barrnap, Aragorn, and Infernal. Prokka processes a single FASTA assembly in minutes and outputs a comprehensive annotation in GFF3, GenBank, FASTA, and tabular formats.

When to Use

Annotating a newly assembled bacterial or archaeal genome from Illumina, PacBio, or Nanopore assemblies
Getting functional protein annotations (CDS with product names, EC numbers, GO terms) from a draft or complete genome
Preparing annotation files for downstream comparative genomics (Roary pan-genome, OrthoFinder)
Annotating viral or phage genomes when kingdom-specific databases are important
Performing metagenome-assembled genome (MAG) annotation with the --metagenome flag
Parsing annotated outputs in Python with BioPython for downstream sequence or feature analysis
Use PGAP (NCBI Prokaryotic Genome Annotation Pipeline) instead when the goal is NCBI GenBank submission with standards compliance
Use Bakta instead for faster annotation with built-in NCBI-compatible outputs and a more regularly updated database

Prerequisites

Software: Prokka ≥ 1.14, Perl 5, Prodigal, Barrnap, HMMER3, BLAST+, Aragorn, Infernal, tbl2asn
Python packages (for output parsing): biopython, pandas, matplotlib
Input: assembled genome in FASTA format (complete or draft with multiple contigs)
Environment: conda strongly recommended to handle the Perl and C dependency stack

Check before installing: The tool may already be available in the current environment (e.g., inside a pixi / conda env). Run command -v prokka first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via pixi run prokka rather than bare prokka.

# Install Prokka via conda/mamba (recommended)
conda install -c conda-forge -c bioconda prokka

# Or with mamba (faster)
mamba install -c conda-forge -c bioconda prokka

# Verify installation and database setup
prokka --version
# prokka 1.14.6

# Check that required tools are on PATH
prokka --depends
# prokka needs: awk, sed, grep, makeblastdb, blastp, hmmscan, ...

# Install Python parsing dependencies
pip install biopython pandas matplotlib

Quick Start

# Annotate a bacterial genome assembly — results in results/ directory
prokka genome.fasta \
    --outdir results/ \
    --prefix sample1 \
    --kingdom Bacteria \
    --cpus 4

# Check output summary
cat results/sample1.txt
# Organism: Genus species strain
# Contigs: 1
# Bases: 4639675
# CDS: 4140
# rRNA: 22
# tRNA: 86

echo "Annotation complete. Key output files:"
ls results/sample1.{gff,gbk,faa,ffn,tsv}

Workflow

Step 1: Install and Verify Prokka

Install Prokka and confirm all dependent tools are accessible in the current environment.

# Create a dedicated conda environment
conda create -n prokka_env -c conda-forge -c bioconda prokka python=3.10 -y
conda activate prokka_env

# Verify Prokka version and all tool dependencies
prokka --version
# prokka 1.14.6

prokka --depends
# Checking that required tools are installed...
# OK: makeblastdb is installed (2.13.0+)
# OK: blastp is installed (2.13.0+)
# OK: hmmscan is installed (3.3.2)
# OK: prodigal is installed (2.6.3)
# OK: barrnap is installed (0.9)

# Check available genus-specific databases bundled with Prokka
ls $(conda info --base)/envs/prokka_env/db/genus/
# Archaea  Bacteria  Mitochondria  Viruses

# Install Python parsing tools
pip install biopython pandas matplotlib

Step 2: Prepare the Input Genome

Clean and rename contigs to comply with Prokka's header requirements before annotation.

from Bio import SeqIO
import re

# Load and inspect assembly
input_fasta = "genome.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))
print(f"Input assembly: {len(records)} contigs")
total_bases = sum(len(r) for r in records)
print(f"Total bases: {total_bases:,}")
print(f"Largest contig: {max(len(r) for r in records):,} bp")
print(f"N50 approx: see assembly stats tool")

# Rename contigs to short IDs compatible with Prokka (max 37 chars)
# Prokka requires: no spaces, no special characters in header
cleaned = []
for i, rec in enumerate(records, 1):
    new_id = f"contig_{i:04d}"
    new_rec = rec.__class__(rec.seq, id=new_id, description=f"len={len(rec.seq)}")
    cleaned.append(new_rec)

SeqIO.write(cleaned, "genome_clean.fasta", "fasta")
print(f"\nWrote genome_clean.fasta with {len(cleaned)} renamed contigs")
# genome_clean.fasta: contig_0001 through contig_NNNN

# Alternatively, clean headers with a simple bash one-liner
awk '/^>/{print ">contig_" ++i; next}{print}' genome.fasta > genome_clean.fasta

# Filter out short contigs (< 200 bp) to reduce annotation noise
awk '/^>/{header=$0; next} length($0) >= 200 {print header; print}' \
    genome_clean.fasta > genome_filtered.fasta

echo "Filtered assembly ready: $(grep -c '>' genome_filtered.fasta) contigs"

Step 3: Run Basic Prokka Annotation

Run Prokka with standard options for a bacterial genome, specifying genus/species for database selection.

# Basic annotation with genus/species hint (uses genus-specific protein database first)
prokka genome_clean.fasta \
    --outdir annotation/ \
    --prefix E_coli_K12 \
    --kingdom Bacteria \
    --genus Escherichia \
    --species coli \
    --strain K12 \
    --cpus 8 \
    --mincontiglen 200

# Expected runtime: 2–10 minutes for a typical 4–6 Mb bacterial genome

echo "Prokka annotation output files:"
ls annotation/
# E_coli_K12.err   E_coli_K12.faa   E_coli_K12.ffn
# E_coli_K12.fna   E_coli_K12.gbk   E_coli_K12.gff
# E_coli_K12.log   E_coli_K12.sqn   E_coli_K12.tbl
# E_coli_K12.tsv   E_coli_K12.txt

Step 4: Parse Annotation Summary (TSV)

Load the TSV output for a quick overview of annotated features and their functional assignments.

import pandas as pd

# Load the annotation TSV (tab-delimited feature table)
tsv_file = "annotation/E_coli_K12.tsv"
df = pd.read_csv(tsv_file, sep="\t")
print(f"Total features: {len(df)}")
print(f"Columns: {list(df.columns)}")
# Columns: [locus_tag, ftype, length_bp, gene, EC_number, COG, product]

# Feature type summary
print("\nFeature type counts:")
print(df["ftype"].value_counts().to_string())
# CDS     4140
# tRNA      86
# rRNA      22
# tmRNA      1

# Functional gene annotations (non-hypothetical CDS)
cds_df = df[df["ftype"] == "CDS"].copy()
hypothetical = cds_df["product"].str.contains("hypothetical", case=False, na=True)
print(f"\nCDS with known function: {(~hypothetical).sum()}")
print(f"Hypothetical proteins: {hypothetical.sum()}")

# Genes with EC numbers (enzymes)
ec_annotated = cds_df[cds_df["EC_number"].notna() & (cds_df["EC_number"] != "")]
print(f"CDS with EC numbers: {len(ec_annotated)}")
print(ec_annotated[["locus_tag", "gene", "EC_number", "product"]].head(5).to_string(index=False))

Step 5: Parse GenBank Output with BioPython

Read the GenBank file to access per-gene sequences, qualifiers, and feature coordinates.

from Bio import SeqIO
import pandas as pd

# Parse GenBank file
gbk_file = "annotation/E_coli_K12.gbk"
records = list(SeqIO.parse(gbk_file, "genbank"))
print(f"Contigs in GenBank: {len(records)}")

# Iterate over CDS features and extract details
rows = []
for rec in records:
    for feat in rec.features:
        if feat.type != "CDS":
            continue
        qualifiers = feat.qualifiers
        rows.append({
            "contig":      rec.id,
            "locus_tag":   qualifiers.get("locus_tag", ["?"])[0],
            "gene":        qualifiers.get("gene", [""]

prokka-genome-annotation

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes