scikit-bio
Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
Core Capabilities
1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
Key operations:
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
Common patterns:
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
Important notes:
- Use
DNA,RNA,Proteinclasses for grammared sequences with validation - Use
Sequenceclass for generic sequences without alphabet restrictions - Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
2. Sequence Alignment
Perform pairwise and multiple sequence alignments using the pair_align engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner.
Key capabilities:
- Global, local, and semi-global alignment (free ends configurable) in one function
- Convenience wrappers
pair_align_nucl(BLASTN-like) andpair_align_prot(BLASTP-like) - Configurable scoring: match/mismatch tuple or named substitution matrix; linear or affine gap penalties
PairAlignPathresults carry CIGAR strings and convert to aligned sequences- Multiple sequence alignment storage and manipulation with
TabularMSA
Common patterns:
from skbio import DNA, Protein
from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA
# Nucleotide alignment with BLASTN-like defaults
seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA')
aln = pair_align_nucl(seq1, seq2)
aln.score # alignment score (float)
path = aln.paths[0] # PairAlignPath (repr shows CIGAR)
aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings
# Build a TabularMSA from the alignment path + original sequences
msa = TabularMSA.from_path_seqs(path, (seq1, seq2))
# Customize the algorithm via pair_align (default mode='global')
aln = pair_align(seq1, seq2, mode='local') # Smith-Waterman
aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps
aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3) # substitution matrix, linear gap
# Protein alignment (BLASTP-like, BLOSUM62)
aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE'))
# Read a multiple alignment from file and summarize
msa = TabularMSA.read('alignment.fasta', constructor=DNA)
consensus = msa.consensus()
Important notes:
pair_alignreplaces the removed SSW wrapper (local_pairwise_align_ssw,StripedSmithWaterman) and the deprecated pure-Python aligners (global_pairwise_align,local_pairwise_align_nucleotide, etc.)- The result is a
PairAlignResultthat also unpacks asscore, paths, matrices(usekeep_matrices=Trueto retain the DP matrix) sub_scoreaccepts a(match, mismatch)tuple or a matrix name (e.g.,'NUC.4.4','BLOSUM62');gap_costaccepts a single number (linear) or(open, extend)tuple (affine)- Parse external CIGAR strings with
PairAlignPath.from_cigar('1I8M2D5M2I'); score an existing alignment withalign_score(...)and build a distance matrix from an MSA withalign_dists(...)
3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
Key capabilities:
- Tree construction from distance matrices (UPGMA/WPGMA, Neighbor Joining, GME, BME)
- Tree rearrangement with nearest neighbor interchange (
nni) - Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic via
cophenet, Robinson-Foulds viacompare_rfd) - ASCII visualization
- Newick format I/O
Common patterns:
from skbio import TreeNode
from skbio.tree import nj, upgma, gme, bme, rf_dists
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lca(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_dm = tree.cophenet() # patristic distance matrix among tips
# Compare two trees (Robinson-Foulds)
rf_distance = tree.compare_rfd(other_tree)
# Pairwise RF distances among many trees -> DistanceMatrix
rf_dm = rf_dists([tree, other_tree, third_tree])
Important notes:
- Use
nj()for neighbor joining (classic phylogenetic method) - Use
upgma()for UPGMA/WPGMA (assumes molecular clock) - GME and BME are highly scalable for large trees; refine topology with
nni() cophenet()(formerlytip_tip_distances) returns the patristic distance matrix;compare_rfd()is the Robinson-Foulds method (compare_wrfd/compare_cophenetfor weighted/cophenetic variants)lca()is the lowest common ancestor;lowest_common_ancestorremains as an alias- Trees can be rooted or unrooted; some metrics require specific rooting
4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
Key capabilities:
- Alpha diversity: richness (
sobs,observed_features,chao1,ace), Shannon, Simpson, Hill numbers (hill), Faith's PD (faith_pd), generalized PD (phydiv), Pielou's evenness - Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
Common patterns:
from skbio.diversity import alpha_diversity, beta_diversity
# Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping)
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, taxa=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, taxa=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
Important notes:
- Counts must be integers representing abundances, not relative frequencies
- The phylogenetic-metric argument is
taxa=(renamed fromotu_idsin 0.6.0; the old name is a