Search and Cross-Reference NCBI GEO with ENCODE
When to Use
- User wants to find complementary datasets in NCBI GEO to supplement ENCODE data
- User asks about "GEO", "Gene Expression Omnibus", "supplementary data", or "find related datasets"
- User needs to cross-reference ENCODE experiments with GEO series for additional replicates or conditions
- User wants to link ENCODE accessions to GEO/SRA identifiers for data sharing or citation
- Example queries: "find GEO datasets for pancreatic islet RNA-seq", "link this ENCODE experiment to GEO", "search GEO for complementary ATAC-seq data"
Query the Gene Expression Omnibus programmatically to find complementary datasets, cross-reference ENCODE experiments, and download metadata.
Scientific Rationale
The question: "What additional expression or epigenomic datasets exist in GEO that complement my ENCODE analysis?"
GEO hosts >200,000 series across all organisms and assay types. Many ENCODE experiments are deposited in GEO as secondary archives (ENCODE Portal is primary). GEO also contains vast amounts of non-ENCODE data — disease cohorts, perturbation experiments, time courses — that complement ENCODE's reference epigenomes.
GEO ↔ ENCODE Relationship
- ENCODE processed data is deposited at GEO as standard GSE submissions
- Raw sequencing data goes to SRA (linked from both GEO and ENCODE)
- The ENCODE Portal is canonical; GEO is secondary archive
- GEO accessions are stored in ENCODE's
dbxrefsfield asGEO:GSExxxxx - NCBI maintains a dedicated ENCODE listing: https://www.ncbi.nlm.nih.gov/geo/encode/
GEO Entity Hierarchy
Series (GSE) — An experiment/study
├── Sample (GSM) — Individual measurements
│ ├── references → Platform (GPL)
│ ├── has → Supplementary files (raw data)
│ └── has → Data table (normalized values)
│
└── curated into → DataSet (GDS) [not all GSE get curated]
└── generates → Profiles (gene-level summaries)
Step 1: Find GEO Accessions for ENCODE Experiments
From ENCODE → GEO
ENCODE experiments may have GEO cross-references in their metadata. After tracking an experiment:
encode_track_experiment(accession="ENCSR...")
Check the experiment's dbxrefs field for GEO:GSExxxxx entries. If found, link it:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="geo_accession",
reference_id="GSE12345"
)
From GEO → ENCODE
Search GEO for ENCODE-deposited data:
# Via NCBI E-utilities
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=ENCODE[KEYWORD]+AND+gse[ETYP]&retmax=100&usehistory=y&tool=encode_mcp&email=YOUR_EMAIL"
Step 2: Search GEO for Complementary Datasets
E-utilities Search Syntax
Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
Required parameters: db=gds, term=QUERY, tool=encode_mcp, email=YOUR_EMAIL
Rate limit: 3 req/sec without API key, 10 req/sec with key. Get a key at https://www.ncbi.nlm.nih.gov/account/
Search Field Qualifiers
| Qualifier | Purpose | Example |
|---|---|---|
[ETYP] | Entry type | gse[ETYP], gds[ETYP] |
[ORGN] | Organism | "Homo sapiens"[ORGN] |
[PDAT] | Publication date | 2024[PDAT] |
[ACCN] | Accession | GPL96[ACCN] |
[suppFile] | Supplementary file type | bed[suppFile], bw[suppFile] |
Example Searches
# Human pancreas ATAC-seq datasets with BED files
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=pancreas+AND+ATAC-seq+AND+%22Homo+sapiens%22[ORGN]+AND+gse[ETYP]+AND+bed[suppFile]&retmax=50&tool=encode_mcp&email=YOUR_EMAIL"
# ChIP-seq datasets from a specific year
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=ChIP-seq+AND+H3K27ac+AND+gse[ETYP]+AND+2024[PDAT]&retmax=50&tool=encode_mcp&email=YOUR_EMAIL"
# Datasets associated with a PubMed ID
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=gds&id=PMID&tool=encode_mcp&email=YOUR_EMAIL"
Step 3: Retrieve GEO Metadata
Get Summary for GEO Records
# Step 1: Search (returns UIDs, NOT accessions)
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=GSE12345[ACCN]&tool=encode_mcp&email=YOUR_EMAIL"
# Step 2: Get summary (use UID from step 1)
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gds&id=UID&version=2.0&tool=encode_mcp&email=YOUR_EMAIL"
Direct Record Access (acc.cgi)
# Get full SOFT-format record
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=self&view=full&form=text"
# Get XML (MINiML) format
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=self&view=full&form=xml"
# Get all sample metadata for a series
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=gsm&view=brief&form=text"
Step 4: Download GEO Data Files
FTP Directory Convention
GEO uses a "nnn" directory pattern: replace last 3 digits with "nnn".
| Accession | FTP Path |
|---|---|
| GSE12345 | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/ |
| GSM575 | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM575/ |
Key Download Paths
| Content | Path Under Series Directory |
|---|---|
| Series matrix (expression table) | matrix/GSE12345_series_matrix.txt.gz |
| SOFT metadata | soft/GSE12345_family.soft.gz |
| MINiML (XML) | miniml/GSE12345_family.xml.tgz |
| All supplementary files | suppl/GSE12345_RAW.tar |
| Individual supplementary | suppl/FILENAME.gz |
Download Commands
# Download series matrix (fastest for expression data)
wget "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/matrix/GSE12345_series_matrix.txt.gz"
# Download all supplementary files
wget "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/suppl/GSE12345_RAW.tar"
Format Selection Guide
| Use Case | Format | Speed |
|---|---|---|
| Expression matrix analysis | Series matrix | Fastest (10-100x vs SOFT) |
| Full metadata extraction | SOFT | Complete but slow |
| XML processing | MINiML | Good for programmatic parsing |
| Peak/BED files | Supplementary | Direct download |
| Raw sequencing reads | SRA (not GEO) | Use SRA Toolkit |
Step 5: Cross-Reference Workflow
ENCODE + GEO Integration Pattern
1. Find ENCODE experiments of interest:
encode_search_experiments(assay_title="total RNA-seq", organ="pancreas")
2. For each experiment, check for GEO accession:
encode_get_experiment(accession="ENCSR...")
→ Look in dbxrefs for "GEO:GSExxxxx"
3. If GEO accession found, link it:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="geo_accession",
reference_id="GSE12345"
)
4. Search GEO for complementary non-ENCODE datasets:
E-utils search for same tissue + different assay or condition
5. Download GEO metadata for comparison:
acc.cgi or E-utils esummary
6. Log the cross-reference:
encode_log_derived_file(
file_path="/path/to/comparison.tsv",
source_accessions=["ENCSR...", "GSE12345"],
description="ENCODE-GEO cross-tissue comparison"
)
Finding SRA Accessions from GEO
For sequencing data, raw reads are in SRA, not GEO:
# Link GEO to SRA
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gds&db=sra&id=GDS_UID&tool=encode_mcp&email=YOUR_EMAIL"
Python alternative using pysradb:
from pysradb.search import SraSearch
# Convert GSE to SRP
pysradb gse-to-srp GSE12345
# Get all SRR run accessions
pysradb gsm-to-srr GSM12345
Pitfalls and Caveats
- E-utils return UIDs, not accessions: GEO search returns numeric UIDs. You must call ESummary to get the actual GSE/GDS accession numbers.
- Not all ENCODE experiments have GEO accessions: The
dbxrefsfield may be empty. ENCOD