GEO Database

Overview

The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.

When to Use This Skill

This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.

Core Capabilities

1. Understanding GEO Data Organization

GEO organizes data hierarchically using different accession types:

Series (GSE): A complete experiment with a set of related samples

Example: GSE123456
Contains experimental design, samples, and overall study information
Largest organizational unit in GEO
Current count: 264,928+ series

Sample (GSM): A single experimental sample or biological replicate

Example: GSM987654
Contains individual sample data, protocols, and metadata
Linked to platforms and series
Current count: 8,068,632+ samples

Platform (GPL): The microarray or sequencing platform used

Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
Describes the technology and probe/feature annotations
Shared across multiple experiments
Current count: 27,739+ platforms

DataSet (GDS): Curated collections with consistent formatting

Example: GDS5678
Experimentally-comparable samples organized by study design
Processed for differential analysis
Subset of GEO data (4,348 curated datasets)
Ideal for quick comparative analyses

Profiles: Gene-specific expression data linked to sequence features

Queryable by gene name or annotation
Cross-references to Entrez Gene
Enables gene-centric searches across all studies

2. Searching GEO Data

GEO DataSets Search:

Search for studies by keywords, organism, or experimental conditions:

from Bio import Entrez

# Configure Entrez (required)
Entrez.email = "your.email@example.com"

# Search for datasets
def search_geo_datasets(query, retmax=20):
    """Search GEO DataSets database"""
    handle = Entrez.esearch(
        db="gds",
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Example searches
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")

# Search by specific platform
results = search_geo_datasets("GPL570[Accession]")

# Search by study type
results = search_geo_datasets("expression profiling by array[DataSet Type]")

GEO Profiles Search:

Find gene-specific expression patterns:

# Search for gene expression profiles
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
    """Search GEO Profiles for a specific gene"""
    query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
    handle = Entrez.esearch(
        db="geoprofiles",
        term=query,
        retmax=retmax
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Find TP53 expression across studies
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")

Advanced Search Patterns:

# Combine multiple search terms
def advanced_geo_search(terms, operator="AND"):
    """Build complex search queries"""
    query = f" {operator} ".join(terms)
    return search_geo_datasets(query)

# Find recent high-throughput studies
search_terms = [
    "RNA-seq[DataSet Type]",
    "Homo sapiens[Organism]",
    "2024[Publication Date]"
]
results = advanced_geo_search(search_terms)

# Search by author and condition
search_terms = [
    "Smith[Author]",
    "diabetes[Disease]"
]
results = advanced_geo_search(search_terms)

3. Retrieving GEO Data with GEOparse (Recommended)

GEOparse is the primary Python library for accessing GEO data:

Installation:

uv pip install GEOparse

Basic Usage:

import GEOparse

# Download and parse a GEO Series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Access series metadata
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])

# Access sample information
for gsm_name, gsm in gse.gsms.items():
    print(f"Sample: {gsm_name}")
    print(f"  Title: {gsm.metadata['title'][0]}")
    print(f"  Source: {gsm.metadata['source_name_ch1'][0]}")
    print(f"  Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")

# Access platform information
for gpl_name, gpl in gse.gpls.items():
    print(f"Platform: {gpl_name}")
    print(f"  Title: {gpl.metadata['title'][0]}")
    print(f"  Organism: {gpl.metadata['organism'][0]}")

Working with Expression Data:

import GEOparse
import pandas as pd

# Get expression data from series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Extract expression matrix
# Method 1: From series matrix file (fastest)
if hasattr(gse, 'pivot_samples'):
    expression_df = gse.pivot_samples('VALUE')
    print(expression_df.shape)  # genes x samples

# Method 2: From individual samples
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'table'):
        expression_data[gsm_name] = gsm.table['VALUE']

expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")

Accessing Supplementary Files:

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Download supplementary files
gse.download_supplementary_files(
    directory="./data/GSE123456_suppl",
    download_sra=False  # Set to True to download SRA files
)

# List available supplementary files
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'supplementary_files'):
        print(f"Sample {gsm_name}:")
        for file_url in gsm.metadata.get('supplementary_file', []):
            print(f"  {file_url}")

Filtering and Subsetting Data:

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Filter samples by metadata
control_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'control' in gsm.metadata.get('title', [''])[0].lower()
]

treatment_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]

print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")

# Extract subset expression matrix
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]

4. Using NCBI E-utilities for GEO Access

E-utilities provide lower-level programmatic access to GEO metadata:

Basic E-utilities Workflow:

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# Step 1: Search for GEO entries
def search_geo(query, db="gds", retmax=100):
    """Search GEO using E-utilities"""
    handle = Entrez.esearch(
        db=db,
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Step 2: Fetch summaries
def fetch_geo_summaries(id_list, db="gds"):
    """Fetch document summaries for GEO entries"""
    ids = ",".join(id_list)
    handle = Entrez.esummary(db=db, id=ids)
    summaries = Entrez.read(handle)
    handle.close()
    return summaries

# Step 3: Fetch full records
def fetch_geo_records(id_list, db="gds"):
    """Fetch full GEO records"""
    ids = ",".join(id_list)
    handle = Entrez.efetch(db=db, id=ids, retmode="xml")
    records = Entrez.read(handle)
    handle.close()
    return records

# Example workflow
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_resul

geo-database

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday