Biopython: Sequence Analysis Toolkit
Overview
Biopython provides a comprehensive suite of modules for sequence-centric bioinformatics: reading and writing every major biological file format (FASTA, FASTQ, GenBank, GFF), querying NCBI databases programmatically, running BLAST searches and parsing results, aligning sequences pairwise or in multiple-sequence alignments, and building and visualizing phylogenetic trees. This skill focuses on analysis workflows — from NCBI data retrieval through alignment to phylogenetic inference.
For PCR primer design, restriction enzyme digestion, cloning simulation, protein structure analysis (Bio.PDB), and molecular weight/Tm calculations, see biopython-molecular-biology.
When to Use
- Download a gene family from NCBI Nucleotide/Protein, align sequences, and construct a phylogenetic tree
- Parse GenBank or GFF3 annotation files and extract CDS sequences for a set of features
- Run a BLAST search against NCBI
ntornr, filter significant hits, and fetch their full sequences - Compute pairwise sequence identities or score alignments with BLOSUM62/PAM250 matrices
- Index a large multi-FASTA or FASTQ file with
SeqIO.index()for random-access retrieval without loading all sequences into RAM - Convert between sequence formats (FASTA ↔ GenBank ↔ FASTQ ↔ PHYLIP) in a single call
- Traverse, root, prune, and annotate a Newick or Nexus phylogenetic tree programmatically
- Use pysam instead when working with SAM/BAM/CRAM alignment files and mapped reads
- Use scikit-bio instead when you need ecological diversity metrics (UniFrac, beta diversity) or ordination methods such as PCoA
- Use gget instead for quick gene lookups and one-liner NCBI queries without writing an Entrez pipeline
Prerequisites
- Python packages:
biopython,numpy,matplotlib - Optional tools: MUSCLE or ClustalW installed locally (for
Bio.Align.Applicationswrappers) - NCBI access: Set
Entrez.emailbefore any E-utilities call; obtain a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s (default is 3 req/s) - Local BLAST: BLAST+ installed separately (
conda install -c bioconda blast) for offline searches
Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v pythonfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run pythonrather than barepython.
pip install biopython numpy matplotlib
conda install -c bioconda blast # optional, for local BLAST
Quick Start
from Bio import SeqIO, Entrez
from Bio.SeqUtils import gc_fraction
# Fetch a GenBank record and display basic stats
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)} bp")
print(f"GC content: {gc_fraction(record.seq)*100:.1f}%")
print(f"Features: {len(record.features)}")
print(f"First feature: {record.features[0].type} at {record.features[0].location}")
# ID: NM_007294.4
# Length: 7207 bp
# GC content: 47.3%
# Features: 9
Core API
Module 1: SeqIO — File Parsing and Format Conversion
Bio.SeqIO reads and writes every major sequence format (FASTA, FASTQ, GenBank, EMBL, PHYLIP, Nexus). SeqIO.parse() returns an iterator of SeqRecord objects; SeqIO.index() builds an on-disk or in-memory dictionary for large files.
from Bio import SeqIO
# Parse FASTA and FASTQ
fasta_records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"FASTA: {len(fasta_records)} sequences")
# FASTQ: access quality scores
for rec in SeqIO.parse("reads.fastq", "fastq"):
quals = rec.letter_annotations["phred_quality"]
avg_q = sum(quals) / len(quals)
print(f" {rec.id}: {len(rec.seq)} bp, mean Q={avg_q:.1f}")
break # show one example
# Parse GenBank with feature access
for rec in SeqIO.parse("chromosome.gb", "genbank"):
cdss = [f for f in rec.features if f.type == "CDS"]
print(f"{rec.id}: {len(cdss)} CDS features")
for cds in cdss[:3]:
gene = cds.qualifiers.get("gene", ["unknown"])[0]
print(f" {gene}: {cds.location}")
from Bio import SeqIO
# SeqIO.index() for random access without loading all records
# Useful for large reference FASTA files (genomes, nr database subsets)
idx = SeqIO.index("large_genome.fasta", "fasta")
print(f"Index contains {len(idx)} sequences")
# Retrieve specific sequences by ID in O(1)
target = idx["chr1"]
print(f"chr1: {len(target.seq):,} bp")
region = target.seq[1_000_000:1_001_000]
print(f"Region [1M-1M+1kb]: {region[:60]}...")
# Format conversion: GenBank → FASTA in one call
n = SeqIO.convert("annotation.gb", "genbank", "sequences.fasta", "fasta")
print(f"Converted {n} records to FASTA")
Module 2: Seq and SeqRecord — Sequence Objects and Feature Annotations
Bio.Seq represents a biological sequence with in-place operations. Bio.SeqRecord wraps a Seq with an ID, description, and a dictionary of feature annotations. Features use FeatureLocation with strand (+1, -1).
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqUtils import gc_fraction, MeltingTemp
dna = Seq("ATGAAACCCGGGTTTTAA")
print(f"GC content: {gc_fraction(dna)*100:.1f}%")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcript: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")
# Tm calculation for a short primer
primer = Seq("ATGAAACCCGGG")
tm = MeltingTemp.Tm_Wallace(primer)
print(f"Tm (Wallace): {tm:.1f}°C")
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
# Build a SeqRecord with annotated features
gene_seq = Seq("ATGAAACCCGGGTTTTAAATCGATCG" * 10)
record = SeqRecord(gene_seq, id="MY_GENE_001", description="synthetic example gene")
# Annotate a CDS feature
cds = SeqFeature(
FeatureLocation(0, 18, strand=+1),
type="CDS",
qualifiers={"gene": ["myGene"], "product": ["hypothetical protein"]}
)
record.features.append(cds)
# Extract and translate the feature
cds_seq = cds.location.extract(record.seq)
protein = cds_seq.translate(to_stop=True)
print(f"CDS: {cds_seq}")
print(f"Protein: {protein}")
# Slice record preserving feature annotations
sub = record[0:60]
print(f"Subrecord: {len(sub.seq)} bp, {len(sub.features)} features")
Module 3: Entrez — Programmatic NCBI Database Access
Bio.Entrez wraps the NCBI E-utilities API: esearch finds records matching a query, efetch retrieves full records, elink finds related records across databases, and esummary returns document summaries. Always set Entrez.email and respect the 3 req/s rate limit (10 req/s with API key).
from Bio import Entrez, SeqIO
import time
Entrez.email = "your.email@example.com"
# Entrez.api_key = "YOUR_API_KEY" # for 10 req/s
# esearch: find IDs matching a query
handle = Entrez.esearch(db="nucleotide", term="BRCA1[Gene] AND Homo sapiens[Organism] AND mRNA[Filter]", retmax=10)
search_results = Entrez.read(handle)
handle.close()
print(f"Total hits: {search_results['Count']}")
ids = search_results["IdList"]
print(f"Retrieved IDs: {ids}")
# efetch: download full GenBank records
for acc_id in ids[:3]:
handle = Entrez.efetch(db="nucleotide", id=acc_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f" {record.id}: {len(record.seq)} bp — {record.description[:60]}")
time.sleep(0.4) # stay within rate limit
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
# elink: cross-database links (e.g., PubMed article →