ENA Database — European Nucleotide Archive Programmatic Access

Overview

The European Nucleotide Archive (ENA) is EMBL-EBI's comprehensive nucleotide sequence database, encompassing raw sequencing reads, genome assemblies, annotated sequences, and associated metadata. It mirrors and extends INSDC data (GenBank, DDBJ). All access is via REST APIs with no authentication required.

When to Use

Searching for sequencing studies, samples, or experiments by organism, project, or keyword
Downloading raw FASTQ/BAM files for reanalysis of public sequencing datasets
Retrieving genome assemblies with quality statistics (N50, contig count, genome size)
Fetching nucleotide sequences in FASTA or EMBL flat-file format by accession
Exploring taxonomic lineage and finding organisms by partial name
Cross-referencing ENA records with external databases (ArrayExpress, UniProt, PDB)
Building bulk download lists for large-scale sequencing projects
For multi-database Python queries (ENA + UniProt + KEGG), prefer bioservices instead
For NCBI-specific queries (PubMed literature, GenBank records), use pubmed-database or Biopython Entrez

Prerequisites

pip install requests

API constraints:

Rate limit: 50 requests per second across all ENA APIs
No authentication required
Large result sets: use pagination (limit + offset) or streaming (limit=0 for TSV download)
Portal API base: https://www.ebi.ac.uk/ena/portal/api
Browser API base: https://www.ebi.ac.uk/ena/browser/api
Taxonomy API base: https://www.ebi.ac.uk/ena/taxonomy/rest
Cross-ref API base: https://www.ebi.ac.uk/ena/xref/rest

Quick Start

import requests
import time

BASE_PORTAL = "https://www.ebi.ac.uk/ena/portal/api"
BASE_BROWSER = "https://www.ebi.ac.uk/ena/browser/api"
BASE_TAXONOMY = "https://www.ebi.ac.uk/ena/taxonomy/rest"
BASE_XREF = "https://www.ebi.ac.uk/ena/xref/rest"

def ena_query(endpoint, params=None, base=BASE_PORTAL):
    """Reusable ENA API caller with rate-limit compliance."""
    resp = requests.get(f"{base}/{endpoint}", params=params)
    resp.raise_for_status()
    time.sleep(0.02)  # 50 req/sec limit
    return resp

# Search for human RNA-seq studies
resp = ena_query("search", params={
    "result": "study",
    "query": 'tax_tree(9606)',   # `library_strategy` is a `read_run`/`read_experiment` field, not a `study` field
    "fields": "study_accession,study_title",
    "format": "json",
    "limit": 3,
})
studies = resp.json()
for s in studies:
    print(f"{s['study_accession']}: {s['study_title'][:60]}")
# PRJEB12345: Transcriptome analysis of human liver tissue...

Core API

Module 1: Portal API Search

The Portal API provides advanced metadata search across all ENA data types with boolean query syntax, field selection, and pagination.

# Search read runs for a specific study
resp = ena_query("search", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
    "fields": "run_accession,sample_accession,instrument_model,read_count,base_count",
    "format": "json",
    "limit": 5,
})
runs = resp.json()
for r in runs:
    print(f"{r['run_accession']} — {r.get('instrument_model', 'N/A')}, "
          f"{int(r.get('read_count', 0)):,} reads")
# ERR123456 — Illumina HiSeq 2000, 45,231,890 reads

# Count total results without fetching data
count_resp = ena_query("count", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
})
print(f"Total runs: {count_resp.text.strip()}")
# Total runs: 142

Module 2: Browser API Retrieval

Fetch individual records by accession in multiple formats: XML, FASTA, EMBL flat-file, or plain text.

# Retrieve XML metadata for a study
resp = ena_query("xml/PRJEB1787", base=BASE_BROWSER)
print(resp.text[:300])
# <?xml version="1.0" encoding="UTF-8"?><PROJECT_SET>...

# Retrieve FASTA sequence for a coding sequence
resp = ena_query("fasta/M10051.1", base=BASE_BROWSER)
print(resp.text[:200])
# >ENA|M10051|M10051.1 Human insulin mRNA, complete cds.
# AGCCCTCCAGGACAGGCTGCAT...

# Retrieve EMBL flat-file format
resp = ena_query("embl/M10051.1", base=BASE_BROWSER)
print(resp.text[:300])
# ID   M10051; SV 1; linear; mRNA; STD; HUM; 786 BP.
# ...

Module 3: File Reports and Downloads

Get download URLs for FASTQ, submitted, and analysis files. File reports return FTP and Aspera paths.

# Get FASTQ file URLs for specific runs
resp = ena_query("filereport", params={
    "accession": "ERR000589",
    "result": "read_run",
    "fields": "run_accession,fastq_ftp,fastq_bytes,fastq_md5",
    "format": "json",
})
files = resp.json()
for f in files:
    ftp_urls = f.get("fastq_ftp", "").split(";")
    sizes = f.get("fastq_bytes", "").split(";")
    for url, size in zip(ftp_urls, sizes):
        if url:
            print(f"ftp://{url}  ({int(size)/1e6:.1f} MB)")
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_1.fastq.gz  (234.5 MB)
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_2.fastq.gz  (241.2 MB)

Module 4: Taxonomy Queries

Look up organisms by taxonomy ID, scientific name, or partial name match.

# Lookup by taxonomy ID
resp = ena_query("tax-id/9606", base=BASE_TAXONOMY)
tax = resp.json()
print(f"{tax['scientificName']} (taxId: {tax['taxId']}, rank: {tax['rank']})")
# Homo sapiens (taxId: 9606, rank: species)
print(f"Lineage: {tax['lineage'][:80]}...")

# Search by scientific name — endpoint returns a list (one entry per matching taxon)
resp = ena_query("scientific-name/Arabidopsis thaliana", base=BASE_TAXONOMY)
matches = resp.json()
result = matches[0] if isinstance(matches, list) else matches
print(f"Tax ID: {result['taxId']}, Common: {result.get('commonName', 'N/A')}")
# Tax ID: 3702, Common: thale cress

# Suggest organisms by partial name
resp = ena_query("suggest-for-search/salmo", base=BASE_TAXONOMY)
suggestions = resp.json()
for s in suggestions[:3]:
    print(f"  {s['scientificName']} (taxId: {s['taxId']})")
# Salmo salar (taxId: 8030)
# Salmo trutta (taxId: 8032)
# Salmonella enterica (taxId: 28901)

Module 5: Cross-Reference Service

Find links between ENA records and external databases (ArrayExpress, UniProt, PDB, etc.).

# Find cross-references for an ENA accession
resp = ena_query("json/search", base=BASE_XREF, params={
    "accession": "M10051",
})
xrefs = resp.json()
for x in xrefs[:5]:
    print(f"  {x['Source']} → {x['Source Primary Accession']} "
          f"({x.get('Source Description', '')[:50]})")
# UniProt → P01308 (Insulin precursor)
# PDB → 1A7F (Crystal structure of human insulin)

# Search cross-references by external database
resp = ena_query("json/search", base=BASE_XREF, params={
    "source": "UniProt",
    "accession": "P01308",
})
xrefs = resp.json()
for x in xrefs[:3]:
    print(f"  ENA: {x['Target Primary Accession']} — {x.get('Target Description', '')[:60]}")

Module 6: CRAM Reference Registry

Retrieve reference sequences used in CRAM files by MD5 or SHA1 checksum. Essential for CRAM decompression.

# Look up reference by MD5 checksum
md5 = "aef131c3b4b05d8e2b3f907faba5af9b"  # example
try:
    resp = ena_query(
        f"cram/md5/{md5}",
        base="https://www.ebi.ac.uk/ena/cram"
    )
    print(f"Reference found: {len(resp.content)} bytes")
except requests.HTTPError as e:
    if e.response.status_code == 404:
        print("Reference not found — check MD5 checksum")
    else:
        raise

Key Concepts

ENA Data Hierarchy

Level	Accession Prefix	Description	Contains
Study	PRJEB/ERP	Research project	Samples, Experiments
Sample	ERS/SAMEA	Biological sample	Metadata, taxonomy
Experiment	ERX	Library/sequencing setup	Runs
Run	ERR	Sequencing run	Raw read files (FASTQ)
Analysis	ERZ	Derived analysis	Assemblies, alignments
Assembly

ena-database

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday