Bakta Genome Annotation
Overview
Bakta is a command-line pipeline for rapid, standardized annotation of bacterial and archaeal genomes and plasmids. It combines Prodigal for CDS prediction, tRNAscan-SE/Aragorn/Barrnap/Infernal for non-coding RNA, PILER-CR/PILERCR for CRISPR detection, and a tiered DIAMOND/HMM search against a curated UniRef100 + IPS/UPS database to assign gene names, EC numbers, GO terms, and COG categories. Bakta produces NCBI-compatible outputs (GFF3, GenBank, EMBL, INSDC-formatted FASTA, plus a JSON summary and a circular Circos plot) for a typical 5 Mb genome in 5–15 minutes on 8 CPUs.
When to Use
- Annotating bacterial or archaeal genome assemblies (Illumina, PacBio, Nanopore) with NCBI-compatible locus tags and product names
- Annotating plasmids and other circular replicons separately with
--plasmidand--completeflags - Producing JSON-structured annotation outputs that can be parsed without GenBank or GFF3 detours
- Generating a publication-ready circular genome plot via the bundled
bakta_plotcommand - Annotating MAGs (metagenome-assembled genomes) with
--metato disable Prodigal training - Use Prokka instead when you need viral/mitochondrial kingdoms or when you must reproduce a legacy Prokka pipeline exactly
- Use PGAP instead when submitting to NCBI GenBank with full standards compliance
- Use Bakta when you want faster runs, regularly updated UniRef-derived databases, AMRFinderPlus integration, and a JSON summary out of the box
Prerequisites
- Software: Bakta ≥ 1.9, Python 3.8+, Prodigal, tRNAscan-SE, Aragorn, Barrnap, Infernal, DIAMOND, HMMER3, PILER-CR, BLAST+, AMRFinderPlus
- Database: Bakta DB (full ~70 GB, or light ~3 GB) downloaded once with
bakta_db download - Python packages (for output parsing):
biopython,pandas,matplotlib - Input: assembled genome in FASTA format (one or more contigs)
- Hardware: ≥ 16 GB RAM for full DB, ≥ 4 GB RAM for light DB; ≥ 8 CPUs recommended
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v baktafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run baktarather than barebakta.
# Install Bakta via conda/mamba (recommended)
mamba install -c conda-forge -c bioconda bakta
# Verify installation
bakta --version
# bakta 1.9.4
# Download the light database (~3 GB, faster, fewer functional hits)
bakta_db download --output db/ --type light
# Or full database (~70 GB, comprehensive UniRef100 coverage)
# bakta_db download --output db/ --type full
# Install Python parsing dependencies
pip install biopython pandas matplotlib
Quick Start
# Annotate a bacterial genome — results in results/ directory
bakta genome.fasta \
--db db/bakta_db_light \
--output results/ \
--prefix sample1 \
--threads 8
# Inspect the JSON summary for feature counts
python -c "
import json
with open('results/sample1.json') as f:
d = json.load(f)
print('Genus:', d['genome'].get('genus'))
print('Length:', d['genome']['size'], 'bp')
print('CDS:', sum(1 for f in d['features'] if f['type'] == 'cds'))
print('tRNA:', sum(1 for f in d['features'] if f['type'] == 'tRNA'))
"
Workflow
Step 1: Install Bakta and Download the Database
Install Bakta and prepare the reference database. The database download is one-time and reused across runs.
# Create a dedicated conda environment (avoids dependency conflicts)
mamba create -n bakta_env -c conda-forge -c bioconda bakta python=3.11 -y
mamba activate bakta_env
# Verify Bakta and its dependencies
bakta --version
# bakta 1.9.4
bakta --help | head -20
# Download the light database (sufficient for routine annotation)
mkdir -p db/
bakta_db download --output db/ --type light
# Downloads ~3 GB; expands to ~5 GB on disk
# Verify the database was extracted correctly
ls db/bakta_db_light/
# antifam.h3f bakta.db expert oric.fna pfam.h3f rfam-go.tsv ...
# (Optional) Update AMRFinderPlus DB used by Bakta for AMR gene calling
amrfinder -u
# Install Python parsing tools
pip install biopython pandas matplotlib
Step 2: Prepare the Input Assembly
Bakta requires clean FASTA headers without spaces or special characters. Pre-clean and optionally filter short contigs.
from Bio import SeqIO
import re
input_fasta = "genome.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))
print(f"Input assembly: {len(records)} contigs")
total_bases = sum(len(r) for r in records)
print(f"Total bases: {total_bases:,}")
print(f"Largest contig: {max(len(r) for r in records):,} bp")
# Bakta preferred: short, alphanumeric, unique IDs
cleaned = []
for i, rec in enumerate(records, 1):
new_id = f"contig_{i:04d}"
new_rec = rec.__class__(rec.seq, id=new_id, description="")
cleaned.append(new_rec)
SeqIO.write(cleaned, "genome_clean.fasta", "fasta")
print(f"Wrote genome_clean.fasta with {len(cleaned)} contigs")
# Filter out short contigs (<200 bp) which contribute little to annotation
awk 'BEGIN{RS=">"; ORS=""} NR>1 {n=split($0, a, "\n"); seq=""; for(i=2;i<=n;i++) seq=seq a[i]; if (length(seq) >= 200) print ">" $0}' \
genome_clean.fasta > genome_filtered.fasta
echo "Filtered assembly: $(grep -c '>' genome_filtered.fasta) contigs"
Step 3: Run Standard Bakta Annotation
Run Bakta with genus/species hints. Locus tags are auto-generated from the strain field.
# Standard annotation for a draft bacterial genome
bakta genome_clean.fasta \
--db db/bakta_db_light \
--output annotation/ \
--prefix E_coli_K12 \
--genus Escherichia \
--species coli \
--strain K12 \
--locus-tag ECOLI \
--threads 8 \
--keep-contig-headers
# Expected runtime: 5–15 min for ~5 Mb genome on 8 CPUs (light DB)
echo "Bakta annotation outputs:"
ls annotation/
# E_coli_K12.embl E_coli_K12.faa E_coli_K12.ffn
# E_coli_K12.fna E_coli_K12.gbff E_coli_K12.gff3
# E_coli_K12.hypotheticals.faa E_coli_K12.hypotheticals.tsv
# E_coli_K12.json E_coli_K12.log E_coli_K12.png
# E_coli_K12.svg E_coli_K12.tsv E_coli_K12.txt
Step 4: Parse the JSON Summary
Bakta's JSON output is the canonical, machine-readable annotation. Parse it directly for downstream pipelines.
import json
import pandas as pd
from collections import Counter
with open("annotation/E_coli_K12.json") as f:
bakta = json.load(f)
# Genome-level metadata
genome = bakta["genome"]
print(f"Organism: {genome.get('genus')} {genome.get('species')} {genome.get('strain')}")
print(f"Size: {genome['size']:,} bp across {len(bakta['sequences'])} sequences")
print(f"GC content: {genome['gc']:.2%}")
# Feature type counts
features = bakta["features"]
type_counts = Counter(f["type"] for f in features)
print("\nFeature counts:")
for ftype, n in sorted(type_counts.items(), key=lambda x: -x[1]):
print(f" {ftype:>10}: {n}")
# Build a tidy CDS DataFrame
cds_rows = []
for f in features:
if f["type"] != "cds":
continue
cds_rows.append({
"locus_tag": f.get("locus", ""),
"contig": f.get("contig", ""),
"start": f.get("start"),
"stop": f.get("stop"),
"strand": f.get("strand"),
"gene": f.get("gene", ""),
"product": f.get("product", ""),
"length_aa": len(f.get("aa", "")),
})
cds_df = pd.DataFrame(cds_rows)
print(f"\nTotal CDS: {len(cds_df)}")
print(cds_df.head(5).to_string(index=False))
Step 5: Parse the TSV Feature Table
The TSV output is convenient for spreadsheet workflows and quick filtering.
import pandas as pd
# Bakta TSV begins with comment lines starting with '#'
df = pd.read_csv("annotation/E_coli_K12.tsv", sep="\t", comment="#",
names=["sequence_id", "type", "start", "stop", "strand",
"locus_ta