gget — Unified Genomic Database Access

Overview

gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).

When to Use

Looking up gene information (names, IDs, descriptions) across species from Ensembl
Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
Running BLAST or BLAT searches against standard reference databases
Predicting protein 3D structures with AlphaFold2 from amino acid sequences
Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
Querying single-cell RNA-seq datasets from CELLxGENE Census
Finding disease and drug associations for a gene target via OpenTargets
Downloading Ensembl reference genomes and annotations for a species
Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
Getting tissue expression and correlated genes from ARCHS4
For batch processing or advanced BLAST parameters, use biopython instead
For programmatic multi-database workflows with rate limiting, use bioservices instead

Prerequisites

Python packages: gget
Optional setup: Some modules require gget setup <module> before first use (alphafold, cellxgene, elm, gpt)
Environment: Clean virtual environment recommended to avoid dependency conflicts
API notes: gget queries remote databases — rate-limit large batch queries with time.sleep(). Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs per gget.info() call

pip install gget

# Optional: setup modules that need additional dependencies
gget setup alphafold   # ~4GB model parameters, requires OpenMM
gget setup cellxgene   # cellxgene-census package
gget setup elm         # local ELM database

Quick Start

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")

# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")

Core API

Module 1: Reference & Gene Search (ref, search, info, seq)

Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")

import gget

# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")

# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")

Module 2: Sequence Alignment (blast, blat, muscle, diamond)

BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.

import gget
import time

# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2)  # Rate-limit between BLAST queries

# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")

import gget

# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)

# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
    "GGETISAWESQME",
    reference="reference.fasta",
    sensitivity="very-sensitive",
    threads=4
)
print(f"Alignments found: {len(diamond_results)}")

Module 3: Protein Structure (pdb, alphafold, elm)

Download PDB structures, predict structures with AlphaFold2, find linear motifs.

import gget

# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)

# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")

import gget

# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")

Module 4: Expression & Correlation (archs4, cellxgene, bgee)

Gene expression, tissue expression, correlated genes, single-cell data.

import gget

# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")

# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")

import gget

# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
    gene=["ACE2", "TMPRSS2"],
    tissue="lung",
    cell_type="epithelial cell",
    census_version="2023-07-25"  # pin version for reproducibility
)
print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")

# Orthologs and expression from Bgee
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
print(f"Orthologs in {len(orthologs)} species")

Module 5: Disease & Drug Associations (opentargets, enrichr)

Disease associations, drug targets, enrichment analysis.

import gget

# Disease associations from OpenTargets
diseases = gget.opentargets("ENSG00000169194", resource="diseases", limit=10)
print(f"Associated diseases: {len(diseases)}")

# Drug associations
drugs = gget.opentargets("ENSG00000169194", resource="drugs", limit=10)
print(f"Associated drugs: {len(drugs)}")

# OpenTargets resources: diseases, drugs, tractability, pharmacogenetics,
#   expression, depmap, interactions

import gget

# Enrichment analysis via Enrichr
# Database shortcuts: 'pathway' (KEGG), 'transcription' (ChEA),
#   'ontology' (GO_BP), 'diseases_drugs' (GWAS), 'celltypes' (PanglaoDB)
enrichment = gget.enrichr(
    ["ACE2", "AGT", "AGTR1", "TMPRSS2", "DPP4"],
    database="ontology"
)
print(f"Enriched terms: {len(enrichment)}")
print(enrichment[["Term", "Adjusted P-value"]].head())

Module 6: Cancer Genomics (cbio, cosmic)

Cancer mutations, copy number alterations, and somatic mutation databases.

import gget

# Search cBioPortal studies
studies = gget.cbio_search(["breast", "lung"])
print(f"Studies found: {len(studies)}")

# Plot cancer genomics heatmap
gget.cbio_plot(
    ["msk_impact_2017"],
    ["AKT1", "ALK", "BRAF"],
    stratification="tissue",
    variation_type="mutation_occurrences"
)

import gget

# COSMIC: requires account + local database download
# First-time: gget.cosmic(searchterm="", download_cosmic=True,
#   email="user@example.com", password="xxx", cosmic_project="cancer")
cosmic_results = gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
print

gget-genomic-databases

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes