Integrating NHGRI-EBI GWAS Catalog with ENCODE Regulatory Data
When to Use
- User wants to intersect ENCODE regulatory elements with GWAS-associated variants
- User asks about "GWAS", "genome-wide association", "disease variants", or "trait-associated SNPs"
- User needs to find which GWAS hits overlap enhancers, promoters, or TF binding sites
- User wants to prioritize GWAS loci by functional annotation from ENCODE data
- Example queries: "find GWAS variants in my H3K27ac peaks", "which diabetes GWAS hits overlap pancreas enhancers?", "annotate GWAS loci with ENCODE regulatory marks"
Connect genome-wide association study findings with ENCODE functional annotations to identify which regulatory elements harbor disease-associated variants and prioritize causal mechanisms for non-coding GWAS hits.
Scientific Rationale
The question: "Which of the disease-associated variants from GWAS fall within active regulatory elements, and what can ENCODE tell us about their functional impact?"
The GWAS Catalog (maintained by NHGRI-EBI) contains over 500,000 variant-trait associations from 6,000+ publications. The central challenge of post-GWAS analysis is that >90% of these associations point to non-coding regions of the genome. ENCODE provides the essential functional annotation layer: if a GWAS variant falls within an active enhancer in disease-relevant tissue, that enhancer becomes a candidate causal mechanism.
This was first demonstrated systematically by Maurano et al. (2012, Science), who showed that disease-associated variants are enriched in DNase I hypersensitive sites (DHSs), and that the cell-type specificity of the DHS predicts the relevant disease tissue. This foundational insight drives the entire GWAS-ENCODE integration framework.
Scale of the Problem
- GWAS Catalog: 500,000+ associations, 100,000+ unique variants, 5,000+ traits
- ENCODE cCREs: 926,535 regulatory elements covering 7.9% of the genome
- Overlap expectation: ~8% of random variants would overlap a cCRE by chance
- Observed enrichment: GWAS variants show 2-5x enrichment in regulatory elements (higher for tissue-matched elements)
Key Literature
- Sollis et al. 2023 "The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource" (Nucleic Acids Research). The current GWAS Catalog publication describing the REST API, summary statistics hosting, and expanded annotation pipeline. DOI: 10.1093/nar/gkac1010
- Buniello et al. 2019 "The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019" (Nucleic Acids Research, ~3,500 citations). The widely-cited GWAS Catalog reference describing curation standards and the move to EFO ontology for traits. DOI: 10.1093/nar/gky1120
- Maurano et al. 2012 "Systematic localization of common disease-associated variation in regulatory DNA" (Science, ~3,000 citations). The foundational demonstration that GWAS variants concentrate in DNase I hypersensitive sites, with cell-type-specific enrichment predicting disease-relevant tissues. Enabled de novo identification of pathogenic cell types from variant sets. DOI: 10.1126/science.1222794
- ENCODE Project Consortium 2020 (Nature, ~1,656 citations). Registry of 926,535 human cCREs that provides the regulatory annotation layer for GWAS variant interpretation. DOI: 10.1038/s41586-020-2493-4
- Finucane et al. 2015 (Nature Genetics, ~2,253 citations). Stratified LD Score Regression (S-LDSC) for partitioning heritability into ENCODE-defined functional categories. DOI: 10.1038/ng.3404
- Nasser et al. 2021 (Nature, ~468 citations). ABC model linked 5,036 GWAS signals to 2,249 genes using ENCODE data. DOI: 10.1038/s41586-021-03446-x
GWAS Catalog REST API Reference
Base URL: https://www.ebi.ac.uk/gwas/rest/api
No authentication required. Responses are JSON (HAL format).
Key Endpoints
| Endpoint | Purpose | Parameters |
|---|---|---|
/singleNucleotidePolymorphisms/{rsId} | Get variant details | rsId (e.g., rs7903146) |
/singleNucleotidePolymorphisms/{rsId}/associations | Get associations for a variant | rsId |
/associations?pubmedId={pmid} | Get associations from a study | PubMed ID |
/studies?diseaseTrait={trait} | Find studies by trait name | Trait string |
/efoTraits/{efoId} | Get trait details by EFO ID | EFO ID |
/efoTraits/{efoId}/associations | Associations for a trait | EFO ID |
/studies/{studyId} | Study details | Study accession (GCST...) |
Pagination
All list endpoints support pagination:
?page=0&size=20(default page size is 20, max is 500)
Bulk Downloads
For genome-wide analysis, use the GWAS Catalog downloads (faster than API):
- All associations:
https://www.ebi.ac.uk/gwas/api/search/downloads/full - Alternative:
https://www.ebi.ac.uk/gwas/docs/file-downloads - Format: TSV with columns for variant, trait, p-value, OR/beta, study, etc.
Step 1: Define the Disease/Trait and Relevant Tissues
Query GWAS Catalog for a Trait
import requests
# Search by trait name
trait = "type 2 diabetes"
url = "https://www.ebi.ac.uk/gwas/rest/api/studies"
params = {"diseaseTrait": trait}
response = requests.get(url, params=params)
studies = response.json()["_embedded"]["studies"]
print(f"Found {len(studies)} GWAS studies for '{trait}'")
for study in studies[:5]:
print(f" {study['accessionId']}: {study['publicationInfo']['title'][:80]}...")
Use EFO IDs for Standardized Trait Queries
The Experimental Factor Ontology (EFO) standardizes trait names:
| Common Trait | EFO ID | EFO Term |
|---|---|---|
| Type 2 diabetes | EFO_0001360 | type II diabetes mellitus |
| Breast cancer | EFO_0000305 | breast carcinoma |
| Alzheimer's disease | MONDO_0004975 | Alzheimer disease |
| Crohn's disease | EFO_0000384 | Crohn's disease |
| Coronary artery disease | EFO_0001645 | coronary artery disease |
| Schizophrenia | EFO_0000692 | schizophrenia |
# Query by EFO ID (more precise)
efo_id = "EFO_0001360"
url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations"
params = {"size": 500}
response = requests.get(url, params=params)
Map Trait to ENCODE Tissues
Following the Maurano 2012 framework — disease-associated variants are enriched in tissue-specific regulatory elements:
| Disease Category | Expected Enriched ENCODE Tissues |
|---|---|
| Type 2 diabetes | Pancreatic islets, liver, adipose, skeletal muscle |
| Autoimmune diseases | Immune cells (T/B cells, monocytes), thymus |
| Neuropsychiatric | Brain (cortex, hippocampus), neurons |
| Cardiovascular | Heart, blood vessels, blood |
| Liver disease | Liver, hepatocytes (HepG2) |
| Inflammatory bowel | Intestine, colon, immune cells |
| Cancer | Tissue of origin + immune microenvironment |
# Check ENCODE data availability for disease-relevant tissue
encode_get_facets(organ="pancreas")
encode_get_facets(organ="liver")
Step 2: Retrieve GWAS Variants
Get Associations for a Variant
def get_gwas_associations(rs_id):
"""Get all GWAS associations for a variant."""
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations"
response = requests.get(url)
if response.status_code == 200:
return response.json()["_embedded"]["associations"]
return []
# Example: rs7903146 (strongest T2D variant, in TCF7L2)
associations = get_gwas_associations("rs7903146")
for assoc in associations:
trait = assoc["efoTraits"][0]["trait"] if assoc["efoTraits"] else "Unknown"
pval = assoc["pvalue"]
print(f" Trait: {trait}, p-value: {pval}")