gget — Unified Genomic Database Access
Overview
gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).
When to Use
- Looking up gene information (names, IDs, descriptions) across species from Ensembl
- Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
- Running BLAST or BLAT searches against standard reference databases
- Predicting protein 3D structures with AlphaFold2 from amino acid sequences
- Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
- Querying single-cell RNA-seq datasets from CELLxGENE Census
- Finding disease and drug associations for a gene target via OpenTargets
- Downloading Ensembl reference genomes and annotations for a species
- Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
- Getting tissue expression and correlated genes from ARCHS4
- For batch processing or advanced BLAST parameters, use
biopythoninstead - For programmatic multi-database workflows with rate limiting, use
bioservicesinstead
Prerequisites
- Python packages:
gget - Optional setup: Some modules require
gget setup <module>before first use (alphafold, cellxgene, elm, gpt) - Environment: Clean virtual environment recommended to avoid dependency conflicts
- API notes: gget queries remote databases — rate-limit large batch queries with
time.sleep(). Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs pergget.info()call
pip install gget
# Optional: setup modules that need additional dependencies
gget setup alphafold # ~4GB model parameters, requires OpenMM
gget setup cellxgene # cellxgene-census package
gget setup elm # local ELM database
Quick Start
import gget
# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")
# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")
Core API
Module 1: Reference & Gene Search (ref, search, info, seq)
Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.
import gget
# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())
# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")
import gget
# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")
# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")
Module 2: Sequence Alignment (blast, blat, muscle, diamond)
BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.
import gget
import time
# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
"MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2) # Rate-limit between BLAST queries
# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")
import gget
# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)
# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
"GGETISAWESQME",
reference="reference.fasta",
sensitivity="very-sensitive",
threads=4
)
print(f"Alignments found: {len(diamond_results)}")
Module 3: Protein Structure (pdb, alphafold, elm)
Download PDB structures, predict structures with AlphaFold2, find linear motifs.
import gget
# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)
# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
"MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")
import gget
# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")
Module 4: Expression & Correlation (archs4, cellxgene, bgee)
Gene expression, tissue expression, correlated genes, single-cell data.
import gget
# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")
# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")
import gget
# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
gene=["ACE2", "TMPRSS2"],
tissue="lung",
cell_type="epithelial cell",
census_version="2023-07-25" # pin version for reproducibility
)
print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")
# Orthologs and expression from Bgee
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
print(f"Orthologs in {len(orthologs)} species")
Module 5: Disease & Drug Associations (opentargets, enrichr)
Disease associations, drug targets, enrichment analysis.
import gget
# Disease associations from OpenTargets
diseases = gget.opentargets("ENSG00000169194", resource="diseases", limit=10)
print(f"Associated diseases: {len(diseases)}")
# Drug associations
drugs = gget.opentargets("ENSG00000169194", resource="drugs", limit=10)
print(f"Associated drugs: {len(drugs)}")
# OpenTargets resources: diseases, drugs, tractability, pharmacogenetics,
# expression, depmap, interactions
import gget
# Enrichment analysis via Enrichr
# Database shortcuts: 'pathway' (KEGG), 'transcription' (ChEA),
# 'ontology' (GO_BP), 'diseases_drugs' (GWAS), 'celltypes' (PanglaoDB)
enrichment = gget.enrichr(
["ACE2", "AGT", "AGTR1", "TMPRSS2", "DPP4"],
database="ontology"
)
print(f"Enriched terms: {len(enrichment)}")
print(enrichment[["Term", "Adjusted P-value"]].head())
Module 6: Cancer Genomics (cbio, cosmic)
Cancer mutations, copy number alterations, and somatic mutation databases.
import gget
# Search cBioPortal studies
studies = gget.cbio_search(["breast", "lung"])
print(f"Studies found: {len(studies)}")
# Plot cancer genomics heatmap
gget.cbio_plot(
["msk_impact_2017"],
["AKT1", "ALK", "BRAF"],
stratification="tissue",
variation_type="mutation_occurrences"
)
import gget
# COSMIC: requires account + local database download
# First-time: gget.cosmic(searchterm="", download_cosmic=True,
# email="user@example.com", password="xxx", cosmic_project="cancer")
cosmic_results = gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
print