Aggregate DNA Methylation Data Across Studies
When to Use
- User wants to build a tissue-level DNA methylation landscape from multiple WGBS experiments
- User asks "where is DNA methylated in brain?" or "find hypomethylated regions across donors"
- User needs to identify HMRs (hypomethylated regions), UMRs, or PMDs from aggregated WGBS data
- User wants per-CpG weighted methylation averages from multiple experiments
- Example queries: "aggregate WGBS data for liver", "build methylation map across donors", "find unmethylated CpG islands in pancreas"
Build a comprehensive methylation landscape for a tissue/cell type by merging WGBS bedMethyl files from multiple ENCODE experiments.
Scientific Rationale
The question: "What is the DNA methylation state across the genome in my tissue?"
DNA methylation is fundamentally different from histone marks and accessibility:
| Property | Histone/Accessibility | DNA Methylation |
|---|---|---|
| Signal type | Binary (bound/open or not) | Continuous (0-100% methylated) |
| Default state | Unmarked | ~70-80% methylated (CpG context) |
| Biology of interest | Where marks ARE present | Where methylation is ABSENT or REDUCED |
| Aggregation approach | Union of peak calls | Average/median of methylation levels per CpG |
The key insight: Unlike histone ChIP-seq where we want the union of all peaks, for methylation we want the average methylation level per CpG site across individuals. Methylation is a quantitative, continuous signal measured at every CpG dinucleotide.
However, for identifying regulatory regions, we focus on hypomethylated regions (HMRs) — stretches of low methylation that mark active regulatory elements. HMRs can be treated more like peaks for union-style aggregation.
Literature Support
- Roadmap Epigenomics (Schultz et al. 2015, Nature, 2,900+ citations): Established that tissue-specific HMRs mark active regulatory elements; demonstrated per-CpG averaging across biological replicates as standard approach
- DMRcate (Peters et al. 2021, Nucleic Acids Research, 65 citations): Method for calling differentially methylated regions from multiple WGBS samples; uses kernel smoothing across CpG sites
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated methylation data with histone marks and accessibility to define chromatin states
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter. DOI
- Zhou et al. 2020 (Nature Genetics): Tissue-specific methylation patterns: ~80% of CpGs are constitutively methylated, ~10% constitutively unmethylated (CpG islands/promoters), ~10% tissue-variable
- Liu et al. 2024 (Briefings in Bioinformatics): Cross-platform comparison (NovaSeq vs DNBSEQ) showing WGBS is gold standard; coverage depth critically affects accuracy; platform differences exist in GC-rich regions
- Ortega-Recalde et al. 2021 (Methods in Molecular Biology): Demonstrated that even low-coverage WGBS can accurately estimate global methylation levels, with bootstrap methods to quantify uncertainty
Two-Level Analysis
- Per-CpG level: Average methylation at each CpG site across samples (quantitative map)
- Region level: Identify HMRs, PMDs, and UMRs from the averaged profile (union of regulatory regions)
Step 1: Find All Available WGBS Data
encode_search_experiments(
assay_title="WGBS",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=100
)
Present a summary to the user:
- Total WGBS experiments
- Labs represented
- Unique donors/biosamples
- Genome coverage per experiment
Use encode_get_facets to check availability:
encode_get_facets(assay_title="WGBS", organ="pancreas")
Note: WGBS is expensive to generate. Typical tissues have 2-5 experiments. Even 2 biological replicates are valuable for identifying consistent methylation patterns.
Step 2: Quality-Gate Each Experiment
encode_get_experiment(accession="ENCSR...")
WGBS Quality Checks
- Audit status: no ERROR flags
- Bisulfite conversion rate: >=98% (measured by lambda spike-in or non-CpG methylation)
- Genome coverage: >=10x mean CpG coverage for reliable per-site estimates
- Mapping rate: >=50% (bisulfite-converted reads are harder to map)
- Duplication rate: <30%
- Has
methylation state at CpGoutput files (bedMethyl format)
Include if:
- Bisulfite conversion >=98%
- Mean CpG coverage >=10x
- Has bedMethyl output files for GRCh38
Exclude if:
- ERROR audit flags
- Conversion rate <98% (unconverted reads create false methylation calls)
- Very low coverage (<5x mean) — individual CpG estimates unreliable
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download bedMethyl Files
For each experiment:
encode_list_files(
experiment_accession="ENCSR...",
output_type="methylation state at CpG",
assembly="GRCh38"
)
bedMethyl format (ENCODE standard):
chr start end name score strand thickStart thickEnd color coverage percentMethylated
- Column 10: read coverage at this CpG
- Column 11: percent methylation (0-100)
Prefer preferred_default=True files:
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/methylation",
organize_by="flat"
)
Step 4: Per-Sample Quality Filtering
4a. Coverage Filtering (CRITICAL)
Low-coverage CpGs have unreliable methylation estimates. Filter per-sample:
# Keep only CpGs with >= 5x coverage (column 10)
# More stringent: >= 10x for quantitative analysis
awk '$10 >= 5' sample.bedMethyl > sample.covfiltered.bedMethyl
Coverage thresholds by use case:
| Threshold | Use Case | Typical CpGs Retained |
|---|---|---|
| >=3x | Exploratory / maximum retention | ~90% of CpGs |
| >=5x | Standard analysis | ~80% of CpGs |
| >=10x | High-confidence quantitative | ~60% of CpGs |
4b. ENCODE Blocklist Filtering (Amemiya et al. 2019)
# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.covfiltered.bedMethyl -b hg38-blacklist.v2.bed -v > sample.filtered.bedMethyl
4c. Strand Merging (Optional but Recommended)
CpG methylation is typically symmetric (same on both strands). Merge strand-specific calls to increase per-CpG coverage. Caveat: Some ENCODE bedMethyl files may already be strand-merged — check if both strands are present before applying this step:
# Group CpGs by position (forward and reverse strand of same CpG)
# Sum coverage, calculate weighted average methylation
awk 'BEGIN{OFS="\t"} {
# CpG position (use the C position as canonical)
if ($6 == "+") pos = $2
else pos = $2 - 1
key = $1"\t"pos
cov[key] += $10
meth[key] += ($11/100) * $10
} END {
for (k in cov) {
split(k, a, "\t")
avg_meth = (meth[k] / cov[k]) * 100
print a[1], a[2], a[2]+2, "CpG", 0, ".", a[2], a[2]+2, "0,0,0", cov[k], avg_meth
}
}' sample.filtered.bedMethyl | sort -k1,1 -k2,2n > sample.merged_strands.bedMethyl
Step 5: Cross-Sample Aggregation (Per-CpG Averaging)
5a. Create a Unified CpG Matrix
# Step 1: Find CpGs covered in at least M of N samples
# Extract positions from each sample
for f in sample*.merged_strands.bedMethyl; do
awk 'BEGIN{OFS="\t"} {print $1, $2, $3}' "$f"
done | sort -k1,1 -k2,2n | uniq -c | \
awk -v M=2 '$1 >= M {print $2, $3, $4}' OFS="\t" > shared_cpgs.bed
# Step 2: For each sample, extract methylation at shared CpGs
for f in sample*.merged_strands.bedMethyl; do
bedtools intersect -a shared_cpgs.bed -b "$f" -wa -wb | \
awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $NF, $(NF-1)}' > "${f%.bedMethyl}.shar