Aggregate DNA Methylation Data Across Studies

When to Use

User wants to build a tissue-level DNA methylation landscape from multiple WGBS experiments
User asks "where is DNA methylated in brain?" or "find hypomethylated regions across donors"
User needs to identify HMRs (hypomethylated regions), UMRs, or PMDs from aggregated WGBS data
User wants per-CpG weighted methylation averages from multiple experiments
Example queries: "aggregate WGBS data for liver", "build methylation map across donors", "find unmethylated CpG islands in pancreas"

Build a comprehensive methylation landscape for a tissue/cell type by merging WGBS bedMethyl files from multiple ENCODE experiments.

Scientific Rationale

The question: "What is the DNA methylation state across the genome in my tissue?"

DNA methylation is fundamentally different from histone marks and accessibility:

Property	Histone/Accessibility	DNA Methylation
Signal type	Binary (bound/open or not)	Continuous (0-100% methylated)
Default state	Unmarked	~70-80% methylated (CpG context)
Biology of interest	Where marks ARE present	Where methylation is ABSENT or REDUCED
Aggregation approach	Union of peak calls	Average/median of methylation levels per CpG

The key insight: Unlike histone ChIP-seq where we want the union of all peaks, for methylation we want the average methylation level per CpG site across individuals. Methylation is a quantitative, continuous signal measured at every CpG dinucleotide.

However, for identifying regulatory regions, we focus on hypomethylated regions (HMRs) — stretches of low methylation that mark active regulatory elements. HMRs can be treated more like peaks for union-style aggregation.

Literature Support

Roadmap Epigenomics (Schultz et al. 2015, Nature, 2,900+ citations): Established that tissue-specific HMRs mark active regulatory elements; demonstrated per-CpG averaging across biological replicates as standard approach
DMRcate (Peters et al. 2021, Nucleic Acids Research, 65 citations): Method for calling differentially methylated regions from multiple WGBS samples; uses kernel smoothing across CpG sites
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated methylation data with histone marks and accessibility to define chromatin states
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter. DOI
Zhou et al. 2020 (Nature Genetics): Tissue-specific methylation patterns: ~80% of CpGs are constitutively methylated, ~10% constitutively unmethylated (CpG islands/promoters), ~10% tissue-variable
Liu et al. 2024 (Briefings in Bioinformatics): Cross-platform comparison (NovaSeq vs DNBSEQ) showing WGBS is gold standard; coverage depth critically affects accuracy; platform differences exist in GC-rich regions
Ortega-Recalde et al. 2021 (Methods in Molecular Biology): Demonstrated that even low-coverage WGBS can accurately estimate global methylation levels, with bootstrap methods to quantify uncertainty

Two-Level Analysis

Per-CpG level: Average methylation at each CpG site across samples (quantitative map)
Region level: Identify HMRs, PMDs, and UMRs from the averaged profile (union of regulatory regions)

Step 1: Find All Available WGBS Data

encode_search_experiments(
    assay_title="WGBS",
    organ="pancreas",           # user's tissue of interest
    biosample_type="tissue",
    limit=100
)

Present a summary to the user:

Total WGBS experiments
Labs represented
Unique donors/biosamples
Genome coverage per experiment

Use encode_get_facets to check availability:

encode_get_facets(assay_title="WGBS", organ="pancreas")

Note: WGBS is expensive to generate. Typical tissues have 2-5 experiments. Even 2 biological replicates are valuable for identifying consistent methylation patterns.

Step 2: Quality-Gate Each Experiment

encode_get_experiment(accession="ENCSR...")

WGBS Quality Checks

Audit status: no ERROR flags
Bisulfite conversion rate: >=98% (measured by lambda spike-in or non-CpG methylation)
Genome coverage: >=10x mean CpG coverage for reliable per-site estimates
Mapping rate: >=50% (bisulfite-converted reads are harder to map)
Duplication rate: <30%
Has methylation state at CpG output files (bedMethyl format)

Include if:

Bisulfite conversion >=98%
Mean CpG coverage >=10x
Has bedMethyl output files for GRCh38

Exclude if:

ERROR audit flags
Conversion rate <98% (unconverted reads create false methylation calls)
Very low coverage (<5x mean) — individual CpG estimates unreliable

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download bedMethyl Files

For each experiment:

encode_list_files(
    experiment_accession="ENCSR...",
    output_type="methylation state at CpG",
    assembly="GRCh38"
)

bedMethyl format (ENCODE standard):

chr  start  end  name  score  strand  thickStart  thickEnd  color  coverage  percentMethylated

Column 10: read coverage at this CpG
Column 11: percent methylation (0-100)

Prefer preferred_default=True files:

encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/methylation",
    organize_by="flat"
)

Step 4: Per-Sample Quality Filtering

4a. Coverage Filtering (CRITICAL)

Low-coverage CpGs have unreliable methylation estimates. Filter per-sample:

# Keep only CpGs with >= 5x coverage (column 10)
# More stringent: >= 10x for quantitative analysis
awk '$10 >= 5' sample.bedMethyl > sample.covfiltered.bedMethyl

Coverage thresholds by use case:

Threshold	Use Case	Typical CpGs Retained
>=3x	Exploratory / maximum retention	~90% of CpGs
>=5x	Standard analysis	~80% of CpGs
>=10x	High-confidence quantitative	~60% of CpGs

4b. ENCODE Blocklist Filtering (Amemiya et al. 2019)

# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.covfiltered.bedMethyl -b hg38-blacklist.v2.bed -v > sample.filtered.bedMethyl

4c. Strand Merging (Optional but Recommended)

CpG methylation is typically symmetric (same on both strands). Merge strand-specific calls to increase per-CpG coverage. Caveat: Some ENCODE bedMethyl files may already be strand-merged — check if both strands are present before applying this step:

# Group CpGs by position (forward and reverse strand of same CpG)
# Sum coverage, calculate weighted average methylation
awk 'BEGIN{OFS="\t"} {
    # CpG position (use the C position as canonical)
    if ($6 == "+") pos = $2
    else pos = $2 - 1
    key = $1"\t"pos
    cov[key] += $10
    meth[key] += ($11/100) * $10
} END {
    for (k in cov) {
        split(k, a, "\t")
        avg_meth = (meth[k] / cov[k]) * 100
        print a[1], a[2], a[2]+2, "CpG", 0, ".", a[2], a[2]+2, "0,0,0", cov[k], avg_meth
    }
}' sample.filtered.bedMethyl | sort -k1,1 -k2,2n > sample.merged_strands.bedMethyl

Step 5: Cross-Sample Aggregation (Per-CpG Averaging)

5a. Create a Unified CpG Matrix

# Step 1: Find CpGs covered in at least M of N samples
# Extract positions from each sample
for f in sample*.merged_strands.bedMethyl; do
    awk 'BEGIN{OFS="\t"} {print $1, $2, $3}' "$f"
done | sort -k1,1 -k2,2n | uniq -c | \
awk -v M=2 '$1 >= M {print $2, $3, $4}' OFS="\t" > shared_cpgs.bed

# Step 2: For each sample, extract methylation at shared CpGs
for f in sample*.merged_strands.bedMethyl; do
    bedtools intersect -a shared_cpgs.bed -b "$f" -wa -wb | \
    awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $NF, $(NF-1)}' > "${f%.bedMethyl}.shar

methylation-aggregation

Como adicionar

Cole no README do seu repo

Skills relacionadas

xlsx

how-it-works

mem-search

weekly-digests

Receba novas skills de Dados e Análise toda segunda