Biopython: Sequence Analysis Toolkit

Overview

Biopython provides a comprehensive suite of modules for sequence-centric bioinformatics: reading and writing every major biological file format (FASTA, FASTQ, GenBank, GFF), querying NCBI databases programmatically, running BLAST searches and parsing results, aligning sequences pairwise or in multiple-sequence alignments, and building and visualizing phylogenetic trees. This skill focuses on analysis workflows — from NCBI data retrieval through alignment to phylogenetic inference.

For PCR primer design, restriction enzyme digestion, cloning simulation, protein structure analysis (Bio.PDB), and molecular weight/Tm calculations, see biopython-molecular-biology.

When to Use

Download a gene family from NCBI Nucleotide/Protein, align sequences, and construct a phylogenetic tree
Parse GenBank or GFF3 annotation files and extract CDS sequences for a set of features
Run a BLAST search against NCBI nt or nr, filter significant hits, and fetch their full sequences
Compute pairwise sequence identities or score alignments with BLOSUM62/PAM250 matrices
Index a large multi-FASTA or FASTQ file with SeqIO.index() for random-access retrieval without loading all sequences into RAM
Convert between sequence formats (FASTA ↔ GenBank ↔ FASTQ ↔ PHYLIP) in a single call
Traverse, root, prune, and annotate a Newick or Nexus phylogenetic tree programmatically
Use pysam instead when working with SAM/BAM/CRAM alignment files and mapped reads
Use scikit-bio instead when you need ecological diversity metrics (UniFrac, beta diversity) or ordination methods such as PCoA
Use gget instead for quick gene lookups and one-liner NCBI queries without writing an Entrez pipeline

Prerequisites

Python packages: biopython, numpy, matplotlib
Optional tools: MUSCLE or ClustalW installed locally (for Bio.Align.Applications wrappers)
NCBI access: Set Entrez.email before any E-utilities call; obtain a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s (default is 3 req/s)
Local BLAST: BLAST+ installed separately (conda install -c bioconda blast) for offline searches

Check before installing: The tool may already be available in the current environment (e.g., inside a pixi / conda env). Run command -v python first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via pixi run python rather than bare python.

pip install biopython numpy matplotlib
conda install -c bioconda blast  # optional, for local BLAST

Quick Start

from Bio import SeqIO, Entrez
from Bio.SeqUtils import gc_fraction

# Fetch a GenBank record and display basic stats
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"ID: {record.id}")
print(f"Length: {len(record.seq)} bp")
print(f"GC content: {gc_fraction(record.seq)*100:.1f}%")
print(f"Features: {len(record.features)}")
print(f"First feature: {record.features[0].type} at {record.features[0].location}")
# ID: NM_007294.4
# Length: 7207 bp
# GC content: 47.3%
# Features: 9

Core API

Module 1: SeqIO — File Parsing and Format Conversion

Bio.SeqIO reads and writes every major sequence format (FASTA, FASTQ, GenBank, EMBL, PHYLIP, Nexus). SeqIO.parse() returns an iterator of SeqRecord objects; SeqIO.index() builds an on-disk or in-memory dictionary for large files.

from Bio import SeqIO

# Parse FASTA and FASTQ
fasta_records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"FASTA: {len(fasta_records)} sequences")

# FASTQ: access quality scores
for rec in SeqIO.parse("reads.fastq", "fastq"):
    quals = rec.letter_annotations["phred_quality"]
    avg_q = sum(quals) / len(quals)
    print(f"  {rec.id}: {len(rec.seq)} bp, mean Q={avg_q:.1f}")
    break  # show one example

# Parse GenBank with feature access
for rec in SeqIO.parse("chromosome.gb", "genbank"):
    cdss = [f for f in rec.features if f.type == "CDS"]
    print(f"{rec.id}: {len(cdss)} CDS features")
    for cds in cdss[:3]:
        gene = cds.qualifiers.get("gene", ["unknown"])[0]
        print(f"  {gene}: {cds.location}")

from Bio import SeqIO

# SeqIO.index() for random access without loading all records
# Useful for large reference FASTA files (genomes, nr database subsets)
idx = SeqIO.index("large_genome.fasta", "fasta")
print(f"Index contains {len(idx)} sequences")

# Retrieve specific sequences by ID in O(1)
target = idx["chr1"]
print(f"chr1: {len(target.seq):,} bp")
region = target.seq[1_000_000:1_001_000]
print(f"Region [1M-1M+1kb]: {region[:60]}...")

# Format conversion: GenBank → FASTA in one call
n = SeqIO.convert("annotation.gb", "genbank", "sequences.fasta", "fasta")
print(f"Converted {n} records to FASTA")

Module 2: Seq and SeqRecord — Sequence Objects and Feature Annotations

Bio.Seq represents a biological sequence with in-place operations. Bio.SeqRecord wraps a Seq with an ID, description, and a dictionary of feature annotations. Features use FeatureLocation with strand (+1, -1).

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqUtils import gc_fraction, MeltingTemp

dna = Seq("ATGAAACCCGGGTTTTAA")
print(f"GC content: {gc_fraction(dna)*100:.1f}%")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcript: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")

# Tm calculation for a short primer
primer = Seq("ATGAAACCCGGG")
tm = MeltingTemp.Tm_Wallace(primer)
print(f"Tm (Wallace): {tm:.1f}°C")

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Build a SeqRecord with annotated features
gene_seq = Seq("ATGAAACCCGGGTTTTAAATCGATCG" * 10)
record = SeqRecord(gene_seq, id="MY_GENE_001", description="synthetic example gene")

# Annotate a CDS feature
cds = SeqFeature(
    FeatureLocation(0, 18, strand=+1),
    type="CDS",
    qualifiers={"gene": ["myGene"], "product": ["hypothetical protein"]}
)
record.features.append(cds)

# Extract and translate the feature
cds_seq = cds.location.extract(record.seq)
protein = cds_seq.translate(to_stop=True)
print(f"CDS: {cds_seq}")
print(f"Protein: {protein}")

# Slice record preserving feature annotations
sub = record[0:60]
print(f"Subrecord: {len(sub.seq)} bp, {len(sub.features)} features")

Module 3: Entrez — Programmatic NCBI Database Access

Bio.Entrez wraps the NCBI E-utilities API: esearch finds records matching a query, efetch retrieves full records, elink finds related records across databases, and esummary returns document summaries. Always set Entrez.email and respect the 3 req/s rate limit (10 req/s with API key).

from Bio import Entrez, SeqIO
import time

Entrez.email = "your.email@example.com"
# Entrez.api_key = "YOUR_API_KEY"  # for 10 req/s

# esearch: find IDs matching a query
handle = Entrez.esearch(db="nucleotide", term="BRCA1[Gene] AND Homo sapiens[Organism] AND mRNA[Filter]", retmax=10)
search_results = Entrez.read(handle)
handle.close()

print(f"Total hits: {search_results['Count']}")
ids = search_results["IdList"]
print(f"Retrieved IDs: {ids}")

# efetch: download full GenBank records
for acc_id in ids[:3]:
    handle = Entrez.efetch(db="nucleotide", id=acc_id, rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    print(f"  {record.id}: {len(record.seq)} bp — {record.description[:60]}")
    time.sleep(0.4)  # stay within rate limit

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# elink: cross-database links (e.g., PubMed article →

biopython-sequence-analysis

How to add

Drop this on your repo README

Get new DevOps e Infra skills every Monday