UniChem Database
Overview
UniChem is a chemical structure cross-referencing service from EMBL-EBI that links compound records across 20+ public chemistry databases using InChI-based identifiers. It maps a single chemical entity to its corresponding IDs in ChEMBL, DrugBank, PubChem, ChEBI, PDB (RCSB and PDBe), SureChEMBL, HMDB, DrugCentral, BindingDB, and others. Access is via a free REST API at https://www.ebi.ac.uk/unichem/api/v1/ - no API key required. Important: every cross-reference query is sent as POST with a JSON body; only the catalogue endpoint GET /sources is implemented as a GET.
When to Use
- Translating a ChEMBL compound ID to a PubChem CID, DrugBank accession, or ChEBI ID for cross-database analysis
- Resolving an InChIKey to all database sources where a compound appears
- Finding all structurally related compounds (same connectivity, different stereochemistry/salts) across databases using connectivity search
- Validating compound identity across sources before merging datasets from multiple databases
- Building a compound cross-reference table for a drug discovery project (linking bioactivity data in ChEMBL to structural data in PDB)
- Checking if a synthesized compound or a vendor compound exists in any public database by InChIKey
- For full bioactivity profiles (IC50, Ki) use chembl-database-bioactivity; UniChem provides only ID cross-references, not experimental data
- For compound property prediction or substructure searching use pubchem-compound-search; UniChem is for identifier translation only
Prerequisites
- Python packages: requests, pandas, matplotlib
- Data requirements: compound InChIKeys (standard 27-character XXXXXXXXXXXXXX-XXXXXXXXXX-X), source-specific IDs (e.g. CHEMBL25), or PubChem CIDs as starting points
- Environment: internet connection; no API key required
- Rate limits: ~10 requests/second; add time.sleep(0.1) between requests in batch loops; no daily quota
pip install requests pandas matplotlib
Quick Start
The UniChem /compounds endpoint is POST-only - GET returns 405 Method Not Allowed. Submit a JSON body {type: inchikey, compound: KEY} and read per-database hits from compounds[0][sources]. Each source record carries an id (numeric database ID) and a compoundId (the ID in that database).
import requests
UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"
def unichem_post(endpoint: str, body: dict) -> dict:
"""POST request to UniChem API; raise on HTTP errors."""
r = requests.post(f"{UNICHEM_API}/{endpoint}", json=body, timeout=20)
r.raise_for_status()
return r.json()
# Find all database sources for aspirin by InChIKey
inchikey = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N" # aspirin
result = unichem_post("compounds", {"type": "inchikey", "compound": inchikey})
compounds = result.get("compounds", [])
print(f"Found {len(compounds)} compound record(s) for {inchikey}")
if compounds:
sources = compounds[0].get("sources", [])
print(f" Present in {len(sources)} database records")
seen = set()
for src in sources:
if src["id"] in seen:
continue
seen.add(src["id"])
print(f" source id={src['id']:>3} ({src['shortName']:>12}): {src['compoundId']}")
if len(seen) >= 5:
break
# Found 1 compound record(s) for BSYNRYMUTXBXSQ-UHFFFAOYSA-N
# Present in many database records
# source id= 1 ( chembl): CHEMBL25
# source id= 2 ( drugbank): DB00945
# source id= 3 ( rcsb_pdb): AIN
Core API
Query 1: InChIKey Lookup - All Sources
Search for a compound by its standard InChIKey and retrieve all database records. This is the primary cross-reference method. The endpoint is POST /compounds; the response carries one entry in compounds (if found), each with a sources list whose records use id (source database) and compoundId (the ID in that database).
import requests, pandas as pd
UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"
# Common source IDs (verify with the /sources endpoint - see Query 4)
SOURCE_NAMES = {
1: "ChEMBL", 2: "DrugBank", 3: "RCSB PDB", 4: "GtoPdb", 5: "PDBe",
7: "ChEBI", 14: "FDA SRS", 15: "SureChEMBL", 18: "HMDB", 22: "PubChem",
31: "BindingDB", 32: "CompTox", 33: "LIPID MAPS", 34: "DrugCentral",
37: "BRENDA", 38: "Rhea", 41: "SwissLipids", 49: "Probes-and-Drugs",
}
def lookup_by_inchikey(inchikey: str) -> pd.DataFrame:
"""Return all database cross-references for an InChIKey."""
r = requests.post(f"{UNICHEM_API}/compounds",
json={"type": "inchikey", "compound": inchikey}, timeout=20)
r.raise_for_status()
compounds = r.json().get("compounds", [])
if not compounds:
return pd.DataFrame()
rows = []
for src in compounds[0].get("sources", []):
rows.append({
"source_id": src["id"],
"source_name": SOURCE_NAMES.get(src["id"], src.get("shortName", "")),
"compound_id": src["compoundId"],
"url": src.get("url", ""),
})
return pd.DataFrame(rows).sort_values(["source_id", "compound_id"])
# Triclosan cross-references
df = lookup_by_inchikey("XEFQLINVKFYRCS-UHFFFAOYSA-N")
print(f"Triclosan found in {df['source_id'].nunique()} distinct databases ({len(df)} records):")
print(df[["source_name", "compound_id"]].head(8).to_string(index=False))
# Triclosan found in 16 distinct databases
# ChEMBL CHEMBL849
# DrugBank DB08604
# RCSB PDB TCL
# ChEBI CHEBI:164200
# Extract specific source IDs from cross-reference table
def get_id_for_source(inchikey: str, source_id: int) -> str | None:
"""Return the compound ID in a specific database, or None if not found."""
r = requests.post(f"{UNICHEM_API}/compounds",
json={"type": "inchikey", "compound": inchikey}, timeout=20)
r.raise_for_status()
compounds = r.json().get("compounds", [])
if not compounds:
return None
for src in compounds[0].get("sources", []):
if src["id"] == source_id:
return src["compoundId"]
return None
triclosan = "XEFQLINVKFYRCS-UHFFFAOYSA-N"
chembl_id = get_id_for_source(triclosan, source_id=1) # ChEMBL
pubchem_id = get_id_for_source(triclosan, source_id=22) # PubChem
drugbank_id = get_id_for_source(triclosan, source_id=2) # DrugBank
print(f"Triclosan: ChEMBL={chembl_id}, PubChem={pubchem_id}, DrugBank={drugbank_id}")
# Triclosan: ChEMBL=CHEMBL849, PubChem=5564, DrugBank=DB08604
Query 2: Compound Lookup by Source-Specific ID
Given a known compound ID in a specific source database (e.g., a ChEMBL ID), retrieve all cross-references. Use type: sourceID with the compound s source ID alongside the numeric sourceID in the body. Returns the same data shape as the InChIKey lookup.
import requests
UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"
def get_sources_for_compound(compound_id: str, source_id: int) -> list:
"""Get all database cross-references for a compound identified in a specific source.
Args:
compound_id: The ID in the source database (e.g., CHEMBL192)
source_id: UniChem source ID (1=ChEMBL, 2=DrugBank, 22=PubChem, 7=ChEBI)
"""
body = {"type": "sourceID", "compound": compound_id, "sourceID": source_id}
r = requests.post(f"{UNICHEM_API}/compounds", json=body, timeout=20)
r.raise_for_status()
compounds = r.json().get("compounds", [])
if not compounds:
return []
return compounds[0].get("sources", [])
# Sildenafil (Viagra): look up starting from ChEMBL ID
sources = get_sources_for_compound("CHEMBL192", source_id=1)
distinct_dbs = {s["id"] for s in sources}
print(f"Sildenafil (CHEMBL192): {len(sources)} source records across {len(distinct_dbs)} databases")
seen = set()
for s in sources:
if s["id"] in seen:
continue
seen.add(s["id"])
print(f" [{s['id']:>3}] {s['shortName']:>15}: {s['compoundId']}")
if len(seen) >= 8:
brea