ENA Database — European Nucleotide Archive Programmatic Access
Overview
The European Nucleotide Archive (ENA) is EMBL-EBI's comprehensive nucleotide sequence database, encompassing raw sequencing reads, genome assemblies, annotated sequences, and associated metadata. It mirrors and extends INSDC data (GenBank, DDBJ). All access is via REST APIs with no authentication required.
When to Use
- Searching for sequencing studies, samples, or experiments by organism, project, or keyword
- Downloading raw FASTQ/BAM files for reanalysis of public sequencing datasets
- Retrieving genome assemblies with quality statistics (N50, contig count, genome size)
- Fetching nucleotide sequences in FASTA or EMBL flat-file format by accession
- Exploring taxonomic lineage and finding organisms by partial name
- Cross-referencing ENA records with external databases (ArrayExpress, UniProt, PDB)
- Building bulk download lists for large-scale sequencing projects
- For multi-database Python queries (ENA + UniProt + KEGG), prefer
bioservicesinstead - For NCBI-specific queries (PubMed literature, GenBank records), use
pubmed-databaseor Biopython Entrez
Prerequisites
pip install requests
API constraints:
- Rate limit: 50 requests per second across all ENA APIs
- No authentication required
- Large result sets: use pagination (
limit+offset) or streaming (limit=0for TSV download) - Portal API base:
https://www.ebi.ac.uk/ena/portal/api - Browser API base:
https://www.ebi.ac.uk/ena/browser/api - Taxonomy API base:
https://www.ebi.ac.uk/ena/taxonomy/rest - Cross-ref API base:
https://www.ebi.ac.uk/ena/xref/rest
Quick Start
import requests
import time
BASE_PORTAL = "https://www.ebi.ac.uk/ena/portal/api"
BASE_BROWSER = "https://www.ebi.ac.uk/ena/browser/api"
BASE_TAXONOMY = "https://www.ebi.ac.uk/ena/taxonomy/rest"
BASE_XREF = "https://www.ebi.ac.uk/ena/xref/rest"
def ena_query(endpoint, params=None, base=BASE_PORTAL):
"""Reusable ENA API caller with rate-limit compliance."""
resp = requests.get(f"{base}/{endpoint}", params=params)
resp.raise_for_status()
time.sleep(0.02) # 50 req/sec limit
return resp
# Search for human RNA-seq studies
resp = ena_query("search", params={
"result": "study",
"query": 'tax_tree(9606)', # `library_strategy` is a `read_run`/`read_experiment` field, not a `study` field
"fields": "study_accession,study_title",
"format": "json",
"limit": 3,
})
studies = resp.json()
for s in studies:
print(f"{s['study_accession']}: {s['study_title'][:60]}")
# PRJEB12345: Transcriptome analysis of human liver tissue...
Core API
Module 1: Portal API Search
The Portal API provides advanced metadata search across all ENA data types with boolean query syntax, field selection, and pagination.
# Search read runs for a specific study
resp = ena_query("search", params={
"result": "read_run",
"query": 'study_accession="PRJEB1787"',
"fields": "run_accession,sample_accession,instrument_model,read_count,base_count",
"format": "json",
"limit": 5,
})
runs = resp.json()
for r in runs:
print(f"{r['run_accession']} — {r.get('instrument_model', 'N/A')}, "
f"{int(r.get('read_count', 0)):,} reads")
# ERR123456 — Illumina HiSeq 2000, 45,231,890 reads
# Count total results without fetching data
count_resp = ena_query("count", params={
"result": "read_run",
"query": 'study_accession="PRJEB1787"',
})
print(f"Total runs: {count_resp.text.strip()}")
# Total runs: 142
Module 2: Browser API Retrieval
Fetch individual records by accession in multiple formats: XML, FASTA, EMBL flat-file, or plain text.
# Retrieve XML metadata for a study
resp = ena_query("xml/PRJEB1787", base=BASE_BROWSER)
print(resp.text[:300])
# <?xml version="1.0" encoding="UTF-8"?><PROJECT_SET>...
# Retrieve FASTA sequence for a coding sequence
resp = ena_query("fasta/M10051.1", base=BASE_BROWSER)
print(resp.text[:200])
# >ENA|M10051|M10051.1 Human insulin mRNA, complete cds.
# AGCCCTCCAGGACAGGCTGCAT...
# Retrieve EMBL flat-file format
resp = ena_query("embl/M10051.1", base=BASE_BROWSER)
print(resp.text[:300])
# ID M10051; SV 1; linear; mRNA; STD; HUM; 786 BP.
# ...
Module 3: File Reports and Downloads
Get download URLs for FASTQ, submitted, and analysis files. File reports return FTP and Aspera paths.
# Get FASTQ file URLs for specific runs
resp = ena_query("filereport", params={
"accession": "ERR000589",
"result": "read_run",
"fields": "run_accession,fastq_ftp,fastq_bytes,fastq_md5",
"format": "json",
})
files = resp.json()
for f in files:
ftp_urls = f.get("fastq_ftp", "").split(";")
sizes = f.get("fastq_bytes", "").split(";")
for url, size in zip(ftp_urls, sizes):
if url:
print(f"ftp://{url} ({int(size)/1e6:.1f} MB)")
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_1.fastq.gz (234.5 MB)
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_2.fastq.gz (241.2 MB)
Module 4: Taxonomy Queries
Look up organisms by taxonomy ID, scientific name, or partial name match.
# Lookup by taxonomy ID
resp = ena_query("tax-id/9606", base=BASE_TAXONOMY)
tax = resp.json()
print(f"{tax['scientificName']} (taxId: {tax['taxId']}, rank: {tax['rank']})")
# Homo sapiens (taxId: 9606, rank: species)
print(f"Lineage: {tax['lineage'][:80]}...")
# Search by scientific name — endpoint returns a list (one entry per matching taxon)
resp = ena_query("scientific-name/Arabidopsis thaliana", base=BASE_TAXONOMY)
matches = resp.json()
result = matches[0] if isinstance(matches, list) else matches
print(f"Tax ID: {result['taxId']}, Common: {result.get('commonName', 'N/A')}")
# Tax ID: 3702, Common: thale cress
# Suggest organisms by partial name
resp = ena_query("suggest-for-search/salmo", base=BASE_TAXONOMY)
suggestions = resp.json()
for s in suggestions[:3]:
print(f" {s['scientificName']} (taxId: {s['taxId']})")
# Salmo salar (taxId: 8030)
# Salmo trutta (taxId: 8032)
# Salmonella enterica (taxId: 28901)
Module 5: Cross-Reference Service
Find links between ENA records and external databases (ArrayExpress, UniProt, PDB, etc.).
# Find cross-references for an ENA accession
resp = ena_query("json/search", base=BASE_XREF, params={
"accession": "M10051",
})
xrefs = resp.json()
for x in xrefs[:5]:
print(f" {x['Source']} → {x['Source Primary Accession']} "
f"({x.get('Source Description', '')[:50]})")
# UniProt → P01308 (Insulin precursor)
# PDB → 1A7F (Crystal structure of human insulin)
# Search cross-references by external database
resp = ena_query("json/search", base=BASE_XREF, params={
"source": "UniProt",
"accession": "P01308",
})
xrefs = resp.json()
for x in xrefs[:3]:
print(f" ENA: {x['Target Primary Accession']} — {x.get('Target Description', '')[:60]}")
Module 6: CRAM Reference Registry
Retrieve reference sequences used in CRAM files by MD5 or SHA1 checksum. Essential for CRAM decompression.
# Look up reference by MD5 checksum
md5 = "aef131c3b4b05d8e2b3f907faba5af9b" # example
try:
resp = ena_query(
f"cram/md5/{md5}",
base="https://www.ebi.ac.uk/ena/cram"
)
print(f"Reference found: {len(resp.content)} bytes")
except requests.HTTPError as e:
if e.response.status_code == 404:
print("Reference not found — check MD5 checksum")
else:
raise
Key Concepts
ENA Data Hierarchy
| Level | Accession Prefix | Description | Contains |
|---|---|---|---|
| Study | PRJEB/ERP | Research project | Samples, Experiments |
| Sample | ERS/SAMEA | Biological sample | Metadata, taxonomy |
| Experiment | ERX | Library/sequencing setup | Runs |
| Run | ERR | Sequencing run | Raw read files (FASTQ) |
| Analysis | ERZ | Derived analysis | Assemblies, alignments |
| Assembly |