Query gnomAD for Population Variant Data
When to Use
- User wants to check population allele frequencies for variants in ENCODE regulatory regions
- User asks about "gnomAD", "allele frequency", "population genetics", "gene constraint", or "variant frequency"
- User needs to filter regulatory variants by rarity (common vs rare) using population data
- User wants to assess gene constraint (pLI, LOEUF) for targets identified from ENCODE ChIP-seq
- Example queries: "check gnomAD frequency for variants in my peaks", "is this regulatory variant rare?", "what's the constraint score for CTCF?"
Annotate ENCODE-identified regulatory variants with population allele frequencies and gene constraint scores from the Genome Aggregation Database.
Scientific Rationale
The question: "How common is this variant in the population, and how constrained is the gene it regulates?"
ENCODE identifies regulatory elements and the variants within them, but does not provide population frequency data. gnomAD (v4.1: 807,162 individuals, 730,947 exomes + 76,215 genomes) fills this gap — enabling researchers to distinguish common regulatory variants (likely benign or with modest effect) from rare variants (potentially pathogenic or high-impact).
Why gnomAD + ENCODE Together
| ENCODE Provides | gnomAD Provides | Combined Insight |
|---|---|---|
| Variant overlaps cCRE (dELS) | AF = 0.0001 (rare) | Rare variant disrupting an enhancer → high priority |
| Variant in TF binding site | AF = 0.15 (common) | Common regulatory variant → likely modest effect or GWAS candidate |
| Target gene identified | LOEUF = 0.12 (highly constrained) | Constrained gene + rare enhancer variant → strong candidate |
| Variant in CRISPR-validated enhancer | Not in gnomAD (absent) | Ultra-rare/de novo → possible pathogenic regulatory variant |
Key Resources
- gnomAD v4.1 (GRCh38): 807,162 individuals, 62.9M SNVs + 6.2M indels
- gnomAD v3.1.2 (GRCh38): Genome-only dataset, 76,156 genomes
- gnomAD v2.1.1 (GRCh37/hg19): Legacy dataset, still widely used
- ExAC (GRCh37): Superseded by gnomAD. All ExAC samples included in gnomAD v2+.
- gnomAD browser: https://gnomad.broadinstitute.org
Literature Support
- Karczewski et al. 2020 (Nature, ~5,000 citations): gnomAD — mutational constraint spectrum from 141,456 individuals. Defined LOEUF as primary constraint metric. DOI
- Lek et al. 2016 (Nature, ~7,000 citations): ExAC — analysis of protein-coding variation in 60,706 humans. Established pLI for gene constraint. DOI
- Maurano et al. 2012 (Science, ~2,800 citations): Disease variants concentrate in DNase hypersensitive sites. DOI
- Finucane et al. 2015 (Nature Genetics, ~2,253 citations): Stratified LD score regression partitioning heritability into ENCODE annotations. DOI
Step 1: Determine the Query Type
| User Has | Query Strategy |
|---|---|
| Specific variant (rs ID or chr-pos-ref-alt) | Single variant lookup |
| List of GWAS/eQTL variants | Batch variant query |
| Gene of interest (ENCODE target) | Gene constraint lookup |
| Genomic region with ENCODE peaks | Region variant query |
Step 2: Query gnomAD via GraphQL API
Endpoint: https://gnomad.broadinstitute.org/api
Method: POST with GraphQL query in JSON body
Authentication: None required
Rate limit: IP-level throttling; throttle to ~1 request/second for batch queries
Single Variant Lookup
curl -X POST https://gnomad.broadinstitute.org/api \
-H "Content-Type: application/json" \
-d '{
"query": "query { variant(variantId: \"1-55517991-C-CAT\", dataset: gnomad_r4) { exome { ac an af } genome { ac an af } joint { ac an af } } }"
}'
Variant ID format: CHR-POS-REF-ALT (1-based position, no "chr" prefix)
Gene Constraint Lookup
curl -X POST https://gnomad.broadinstitute.org/api \
-H "Content-Type: application/json" \
-d '{
"query": "query { gene(gene_symbol: \"BRCA2\", reference_genome: GRCh38) { symbol gene_id gnomad_constraint { pLI oe_lof oe_lof_lower oe_lof_upper oe_mis oe_mis_lower oe_mis_upper } } }"
}'
Region Variant Query
curl -X POST https://gnomad.broadinstitute.org/api \
-H "Content-Type: application/json" \
-d '{
"query": "query { region(chrom: \"1\", start: 55505222, stop: 55530526, reference_genome: GRCh38) { variants(dataset: gnomad_r4) { variant_id pos ref alt exome { ac af } genome { ac af } } } }"
}'
Step 3: Interpret Constraint Scores
Gene Constraint (Karczewski et al. 2020)
| Metric | Definition | Interpretation |
|---|---|---|
| LOEUF | Loss-of-function observed/expected upper bound 90% CI | <0.35 = highly constrained (v2); <0.6 = constrained (v4) |
| pLI | Probability of being loss-of-function intolerant | >0.9 = LoF-intolerant (legacy metric from ExAC) |
| oe_lof | Observed/expected loss-of-function ratio | <0.2 = highly constrained |
| oe_mis | Observed/expected missense ratio | <0.6 = missense constrained |
| Z_syn | Synonymous Z-score | Near 0 expected; deviation suggests selection |
LOEUF is preferred over pLI for gnomAD v4+. LOEUF is continuous and better calibrated.
Applying Constraint to ENCODE Analysis
For genes identified as targets of ENCODE regulatory elements:
| LOEUF | Interpretation | Implication for Regulatory Variants |
|---|---|---|
| <0.35 | Highly constrained (haploinsufficient) | Regulatory variants likely pathogenic; even modest expression changes may be deleterious |
| 0.35-0.6 | Moderately constrained | Regulatory variants worth investigating |
| >0.6 | Tolerant of LoF | Expression changes likely tolerated; regulatory variants less likely pathogenic |
Step 4: Allele Frequency Filtering
Standard Frequency Thresholds
| Category | Allele Frequency | Use Case |
|---|---|---|
| Ultra-rare | AF < 0.0001 (1 in 10,000) | Mendelian disease candidates |
| Rare | AF < 0.01 (1%) | Rare disease, high-penetrance |
| Low-frequency | 0.01-0.05 | eQTL fine-mapping |
| Common | AF > 0.05 (5%) | GWAS, population-level effects |
Population-Specific Frequencies
gnomAD provides frequencies for genetic ancestry groups:
- African/African American (afr)
- Admixed American/Latino (amr)
- Ashkenazi Jewish (asj)
- East Asian (eas)
- European (Finnish) (fin)
- European (non-Finnish) (nfe)
- Middle Eastern (mid)
- South Asian (sas)
Critical: A variant "rare" globally may be common in one population. Always check population-specific frequencies when interpreting regulatory variants in disease context.
Step 5: Integrated ENCODE + gnomAD Workflow
1. Identify regulatory variants from ENCODE:
encode_search_files(output_type="IDR thresholded peaks", organ="pancreas", file_format="bed")
→ Intersect peaks with GWAS/eQTL variants using bedtools
2. Get allele frequencies from gnomAD:
→ GraphQL query for each variant
→ Filter by desired AF threshold
3. Check constraint of target genes:
→ For each variant-to-gene link (from ABC model, ENCODE enhancer-gene maps)
→ Query gnomAD gene constraint (LOEUF)
4. Prioritize:
→ Rare variant (AF < 0.01) + active enhancer + constrained gene (LOEUF < 0.35) = HIGH PRIORITY
→ Common variant (AF > 0.05) + active enhancer = potential GWAS mechanism
→ Absent from gnomAD + CRISPR-validated enhancer = potential de novo pathogenic
5. Track provenance:
encode_log_derived_file(
file_path="/path/to/prioritized_variants.tsv",
source_accessions=["ENCSR...", "gnomAD_v4.1"],
description="ENCODE regulatory variants filtered by gnomAD AF and constraint",
tool_used="bedtools int