scikit-bio

Overview

scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.

When to Use This Skill

This skill should be used when the user:

Works with biological sequences (DNA, RNA, protein)
Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
Performs sequence alignments or searches for motifs
Constructs or analyzes phylogenetic trees
Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
Performs ordination analysis (PCoA, CCA, RDA)
Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
Analyzes microbiome or community ecology data
Works with protein embeddings from language models
Needs to manipulate biological data tables

Core Capabilities

1. Sequence Manipulation

Work with biological sequences using specialized classes for DNA, RNA, and protein data.

Key operations:

Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
Sequence slicing, concatenation, and searching
Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
Find motifs and patterns using regex
Calculate distances (Hamming, k-mer based)
Handle sequence quality scores and metadata

Common patterns:

import skbio

# Read sequences from file
seq = skbio.DNA.read('input.fasta')

# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()

# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')

# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()

Important notes:

Use DNA, RNA, Protein classes for grammared sequences with validation
Use Sequence class for generic sequences without alphabet restrictions
Quality scores automatically loaded from FASTQ files into positional metadata
Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)

2. Sequence Alignment

Perform pairwise and multiple sequence alignments using the pair_align engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner.

Key capabilities:

Global, local, and semi-global alignment (free ends configurable) in one function
Convenience wrappers pair_align_nucl (BLASTN-like) and pair_align_prot (BLASTP-like)
Configurable scoring: match/mismatch tuple or named substitution matrix; linear or affine gap penalties
PairAlignPath results carry CIGAR strings and convert to aligned sequences
Multiple sequence alignment storage and manipulation with TabularMSA

Common patterns:

from skbio import DNA, Protein
from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA

# Nucleotide alignment with BLASTN-like defaults
seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA')
aln = pair_align_nucl(seq1, seq2)
aln.score                                  # alignment score (float)
path = aln.paths[0]                        # PairAlignPath (repr shows CIGAR)
aligned_seqs = path.to_aligned((seq1, seq2))  # list of gapped strings

# Build a TabularMSA from the alignment path + original sequences
msa = TabularMSA.from_path_seqs(path, (seq1, seq2))

# Customize the algorithm via pair_align (default mode='global')
aln = pair_align(seq1, seq2, mode='local')                       # Smith-Waterman
aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps
aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3)    # substitution matrix, linear gap

# Protein alignment (BLASTP-like, BLOSUM62)
aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE'))

# Read a multiple alignment from file and summarize
msa = TabularMSA.read('alignment.fasta', constructor=DNA)
consensus = msa.consensus()

Important notes:

pair_align replaces the removed SSW wrapper (local_pairwise_align_ssw, StripedSmithWaterman) and the deprecated pure-Python aligners (global_pairwise_align, local_pairwise_align_nucleotide, etc.)
The result is a PairAlignResult that also unpacks as score, paths, matrices (use keep_matrices=True to retain the DP matrix)
sub_score accepts a (match, mismatch) tuple or a matrix name (e.g., 'NUC.4.4', 'BLOSUM62'); gap_cost accepts a single number (linear) or (open, extend) tuple (affine)
Parse external CIGAR strings with PairAlignPath.from_cigar('1I8M2D5M2I'); score an existing alignment with align_score(...) and build a distance matrix from an MSA with align_dists(...)

3. Phylogenetic Trees

Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.

Key capabilities:

Tree construction from distance matrices (UPGMA/WPGMA, Neighbor Joining, GME, BME)
Tree rearrangement with nearest neighbor interchange (nni)
Tree manipulation (pruning, rerooting, traversal)
Distance calculations (patristic via cophenet, Robinson-Foulds via compare_rfd)
ASCII visualization
Newick format I/O

Common patterns:

from skbio import TreeNode
from skbio.tree import nj, upgma, gme, bme, rf_dists

# Read tree from file
tree = TreeNode.read('tree.nwk')

# Construct tree from distance matrix
tree = nj(distance_matrix)

# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lca(['taxon1', 'taxon2'])

# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_dm = tree.cophenet()           # patristic distance matrix among tips

# Compare two trees (Robinson-Foulds)
rf_distance = tree.compare_rfd(other_tree)
# Pairwise RF distances among many trees -> DistanceMatrix
rf_dm = rf_dists([tree, other_tree, third_tree])

Important notes:

Use nj() for neighbor joining (classic phylogenetic method)
Use upgma() for UPGMA/WPGMA (assumes molecular clock)
GME and BME are highly scalable for large trees; refine topology with nni()
cophenet() (formerly tip_tip_distances) returns the patristic distance matrix; compare_rfd() is the Robinson-Foulds method (compare_wrfd/compare_cophenet for weighted/cophenetic variants)
lca() is the lowest common ancestor; lowest_common_ancestor remains as an alias
Trees can be rooted or unrooted; some metrics require specific rooting

4. Diversity Analysis

Calculate alpha and beta diversity metrics for microbial ecology and community analysis.

Key capabilities:

Alpha diversity: richness (sobs, observed_features, chao1, ace), Shannon, Simpson, Hill numbers (hill), Faith's PD (faith_pd), generalized PD (phydiv), Pielou's evenness
Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
Phylogenetic diversity metrics (require tree input)
Rarefaction and subsampling
Integration with ordination and statistical tests

Common patterns:

from skbio.diversity import alpha_diversity, beta_diversity

# Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping)
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
                           tree=tree, taxa=feature_ids)

# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
                            ids=sample_ids, tree=tree, taxa=feature_ids)

# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())

Important notes:

Counts must be integers representing abundances, not relative frequencies
The phylogenetic-metric argument is taxa= (renamed from otu_ids in 0.6.0; the old name is a

scikit-bio

How to add

Drop this on your repo README

Related skills

xlsx

how-it-works

mem-search

weekly-digests

Get new Dados e Análise skills every Monday

scikit-bio

Overview

When to Use This Skill

Core Capabilities

1. Sequence Manipulation

2. Sequence Alignment

3. Phylogenetic Trees

4. Diversity Analysis

Comments · No comments