UniChem Database

Overview

UniChem is a chemical structure cross-referencing service from EMBL-EBI that links compound records across 20+ public chemistry databases using InChI-based identifiers. It maps a single chemical entity to its corresponding IDs in ChEMBL, DrugBank, PubChem, ChEBI, PDB (RCSB and PDBe), SureChEMBL, HMDB, DrugCentral, BindingDB, and others. Access is via a free REST API at https://www.ebi.ac.uk/unichem/api/v1/ - no API key required. Important: every cross-reference query is sent as POST with a JSON body; only the catalogue endpoint GET /sources is implemented as a GET.

When to Use

Translating a ChEMBL compound ID to a PubChem CID, DrugBank accession, or ChEBI ID for cross-database analysis
Resolving an InChIKey to all database sources where a compound appears
Finding all structurally related compounds (same connectivity, different stereochemistry/salts) across databases using connectivity search
Validating compound identity across sources before merging datasets from multiple databases
Building a compound cross-reference table for a drug discovery project (linking bioactivity data in ChEMBL to structural data in PDB)
Checking if a synthesized compound or a vendor compound exists in any public database by InChIKey
For full bioactivity profiles (IC50, Ki) use chembl-database-bioactivity; UniChem provides only ID cross-references, not experimental data
For compound property prediction or substructure searching use pubchem-compound-search; UniChem is for identifier translation only

Prerequisites

Python packages: requests, pandas, matplotlib
Data requirements: compound InChIKeys (standard 27-character XXXXXXXXXXXXXX-XXXXXXXXXX-X), source-specific IDs (e.g. CHEMBL25), or PubChem CIDs as starting points
Environment: internet connection; no API key required
Rate limits: ~10 requests/second; add time.sleep(0.1) between requests in batch loops; no daily quota

pip install requests pandas matplotlib

Quick Start

The UniChem /compounds endpoint is POST-only - GET returns 405 Method Not Allowed. Submit a JSON body {type: inchikey, compound: KEY} and read per-database hits from compounds[0][sources]. Each source record carries an id (numeric database ID) and a compoundId (the ID in that database).

import requests

UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"

def unichem_post(endpoint: str, body: dict) -> dict:
    """POST request to UniChem API; raise on HTTP errors."""
    r = requests.post(f"{UNICHEM_API}/{endpoint}", json=body, timeout=20)
    r.raise_for_status()
    return r.json()

# Find all database sources for aspirin by InChIKey
inchikey = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"  # aspirin
result = unichem_post("compounds", {"type": "inchikey", "compound": inchikey})
compounds = result.get("compounds", [])
print(f"Found {len(compounds)} compound record(s) for {inchikey}")
if compounds:
    sources = compounds[0].get("sources", [])
    print(f"  Present in {len(sources)} database records")
    seen = set()
    for src in sources:
        if src["id"] in seen:
            continue
        seen.add(src["id"])
        print(f"  source id={src['id']:>3} ({src['shortName']:>12}): {src['compoundId']}")
        if len(seen) >= 5:
            break
# Found 1 compound record(s) for BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#   Present in many database records
#   source id=  1 (      chembl): CHEMBL25
#   source id=  2 (    drugbank): DB00945
#   source id=  3 (    rcsb_pdb): AIN

Core API

Query 1: InChIKey Lookup - All Sources

Search for a compound by its standard InChIKey and retrieve all database records. This is the primary cross-reference method. The endpoint is POST /compounds; the response carries one entry in compounds (if found), each with a sources list whose records use id (source database) and compoundId (the ID in that database).

import requests, pandas as pd

UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"

# Common source IDs (verify with the /sources endpoint - see Query 4)
SOURCE_NAMES = {
    1: "ChEMBL", 2: "DrugBank", 3: "RCSB PDB", 4: "GtoPdb", 5: "PDBe",
    7: "ChEBI", 14: "FDA SRS", 15: "SureChEMBL", 18: "HMDB", 22: "PubChem",
    31: "BindingDB", 32: "CompTox", 33: "LIPID MAPS", 34: "DrugCentral",
    37: "BRENDA", 38: "Rhea", 41: "SwissLipids", 49: "Probes-and-Drugs",
}

def lookup_by_inchikey(inchikey: str) -> pd.DataFrame:
    """Return all database cross-references for an InChIKey."""
    r = requests.post(f"{UNICHEM_API}/compounds",
                      json={"type": "inchikey", "compound": inchikey}, timeout=20)
    r.raise_for_status()
    compounds = r.json().get("compounds", [])
    if not compounds:
        return pd.DataFrame()
    rows = []
    for src in compounds[0].get("sources", []):
        rows.append({
            "source_id": src["id"],
            "source_name": SOURCE_NAMES.get(src["id"], src.get("shortName", "")),
            "compound_id": src["compoundId"],
            "url": src.get("url", ""),
        })
    return pd.DataFrame(rows).sort_values(["source_id", "compound_id"])

# Triclosan cross-references
df = lookup_by_inchikey("XEFQLINVKFYRCS-UHFFFAOYSA-N")
print(f"Triclosan found in {df['source_id'].nunique()} distinct databases ({len(df)} records):")
print(df[["source_name", "compound_id"]].head(8).to_string(index=False))
# Triclosan found in 16 distinct databases
#   ChEMBL    CHEMBL849
#   DrugBank  DB08604
#   RCSB PDB  TCL
#   ChEBI     CHEBI:164200

# Extract specific source IDs from cross-reference table
def get_id_for_source(inchikey: str, source_id: int) -> str | None:
    """Return the compound ID in a specific database, or None if not found."""
    r = requests.post(f"{UNICHEM_API}/compounds",
                      json={"type": "inchikey", "compound": inchikey}, timeout=20)
    r.raise_for_status()
    compounds = r.json().get("compounds", [])
    if not compounds:
        return None
    for src in compounds[0].get("sources", []):
        if src["id"] == source_id:
            return src["compoundId"]
    return None

triclosan = "XEFQLINVKFYRCS-UHFFFAOYSA-N"
chembl_id   = get_id_for_source(triclosan, source_id=1)   # ChEMBL
pubchem_id  = get_id_for_source(triclosan, source_id=22)  # PubChem
drugbank_id = get_id_for_source(triclosan, source_id=2)   # DrugBank
print(f"Triclosan: ChEMBL={chembl_id}, PubChem={pubchem_id}, DrugBank={drugbank_id}")
# Triclosan: ChEMBL=CHEMBL849, PubChem=5564, DrugBank=DB08604

Query 2: Compound Lookup by Source-Specific ID

Given a known compound ID in a specific source database (e.g., a ChEMBL ID), retrieve all cross-references. Use type: sourceID with the compound s source ID alongside the numeric sourceID in the body. Returns the same data shape as the InChIKey lookup.

import requests

UNICHEM_API = "https://www.ebi.ac.uk/unichem/api/v1"

def get_sources_for_compound(compound_id: str, source_id: int) -> list:
    """Get all database cross-references for a compound identified in a specific source.

    Args:
        compound_id: The ID in the source database (e.g., CHEMBL192)
        source_id: UniChem source ID (1=ChEMBL, 2=DrugBank, 22=PubChem, 7=ChEBI)
    """
    body = {"type": "sourceID", "compound": compound_id, "sourceID": source_id}
    r = requests.post(f"{UNICHEM_API}/compounds", json=body, timeout=20)
    r.raise_for_status()
    compounds = r.json().get("compounds", [])
    if not compounds:
        return []
    return compounds[0].get("sources", [])

# Sildenafil (Viagra): look up starting from ChEMBL ID
sources = get_sources_for_compound("CHEMBL192", source_id=1)
distinct_dbs = {s["id"] for s in sources}
print(f"Sildenafil (CHEMBL192): {len(sources)} source records across {len(distinct_dbs)} databases")
seen = set()
for s in sources:
    if s["id"] in seen:
        continue
    seen.add(s["id"])
    print(f"  [{s['id']:>3}] {s['shortName']:>15}: {s['compoundId']}")
    if len(seen) >= 8:
        brea

unichem-database

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday