Aggregate Hi-C Chromatin Contacts Across Studies

When to Use

User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
User asks "what regions are in 3D contact in my tissue?" or "aggregate loop calls across donors"
User needs a union catalog of BEDPE loops with resolution-aware anchor matching
User wants to identify high-confidence loops supported by multiple experiments
Example queries: "aggregate Hi-C loops for K562", "combine chromatin contacts across labs", "find consensus TAD boundaries in liver"

Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.

Scientific Rationale

The question: "What regions are in 3D physical contact in my tissue?"

Like histone marks and accessibility, chromatin loops are a detection question. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the union of all detected contacts.

Key Concepts

Hi-C data measures pairwise chromatin interactions genome-wide. After processing:

Contact matrix (.hic file): Genome-wide interaction frequencies at multiple resolutions
Loop calls (BEDPE): Statistically significant point interactions (loops) identified by algorithms like HICCUPS or Juicer
TAD boundaries: Topologically associating domain boundaries
Compartments: A/B compartment assignments

BEDPE format (Paired-End BED):

chr1  start1  end1  chr2  start2  end2  name  score  strand1  strand2

Each row represents a contact between two genomic anchor regions.

Literature Support

Loop Catalog (Reyna et al. 2025, Nucleic Acids Research): Created a union catalog of 4.19M unique loops across 1,089 Hi-C datasets. Demonstrated that union approach captures tissue-specific and constitutive loops. Used resolution-aware merging at 5kb, 10kb, and 25kb bins.
AQuA Tools (Chakraborty et al. 2025): Toolkit for BEDPE intersection, union, and annotation. Handles paired-region arithmetic.
mariner (Flores et al. 2024, Bioinformatics): R/Bioconductor package for BEDPE manipulation including merging loops across experiments with configurable anchor tolerance.
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated Hi-C data across tissues to define regulatory loops connecting enhancers to promoters.
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter from loop anchors. DOI
Mustache (Roayaei Ardakany et al. 2020, Genome Biology, 165 citations): Multi-scale loop caller that recovers more validated loops than HICCUPS. Different callers produce discordant loop sets.
Wolff et al. 2022 (GigaScience): Benchmark showing loop callers intersect by ~50% at most — critical context for why union approach is necessary.

Step 1: Find All Available Hi-C Data

encode_search_experiments(
    assay_title="Hi-C",
    organ="pancreas",           # user's tissue of interest
    biosample_type="tissue",
    limit=100
)

Present a summary to the user:

Total Hi-C experiments
Labs represented
Unique donors/biosamples
Resolution(s) available (check experiment metadata)

Use encode_get_facets to check availability:

encode_get_facets(assay_title="Hi-C", organ="pancreas")

Note: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.

Step 2: Quality-Gate Each Experiment

encode_get_experiment(accession="ENCSR...")

Hi-C Quality Checks

Audit status: no ERROR flags
Sequencing depth: 400M+ valid read pairs for loop calling (ENCODE standard)
Cis/trans ratio: >60% cis contacts expected (low cis suggests noisy library)
Hi-C-specific QC: Library complexity, PCR duplicate rate
Has loop calls (BEDPE output) — not all Hi-C experiments have called loops
Resolution: at least 5-10kb resolution for loop detection

Include if:

Has BEDPE loop calls at consistent resolution
Passes ENCODE audit (no ERROR flags)
Adequate sequencing depth for loop resolution

Exclude if:

ERROR audit flags
Only contact matrices without loop calls
Very low sequencing depth (<200M valid pairs — insufficient for loop calling)

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download Loop Call Files

For each experiment, get BEDPE loop calls:

# Search for loop/interaction files
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bedpe",
    assembly="GRCh38"
)

# Also check for BED-formatted loop files
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="chromatin interactions",
    assembly="GRCh38"
)

# Or contact domains
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="contact domains",
    assembly="GRCh38"
)

File selection priority:

Chromatin interactions (loop calls from HICCUPS or similar)
Contact domains (TADs — different analysis, handle separately)
Replicated loops (if available)

Prefer preferred_default=True files when available.

encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/hic_loops",
    organize_by="flat"
)

Step 4: Understanding Hi-C Resolution and Anchors

Critical: Resolution-Aware Processing

Hi-C loop anchors are binned regions, not precise positions. The resolution determines anchor size:

Resolution	Anchor Width	Best For	Typical Loop Count
5 kb	5,000 bp	Fine-scale promoter-enhancer loops	More loops
10 kb	10,000 bp	Standard analysis	Moderate
25 kb	25,000 bp	Large-scale domain contacts	Fewer loops

All loops being merged must be at the same resolution, or anchors must be harmonized to a common resolution.

Harmonizing Resolution

If experiments have loops called at different resolutions:

# Expand 5kb anchors to 10kb resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
    # Bin anchor 1
    bin1_start = int($2/res) * res
    bin1_end = bin1_start + res
    # Bin anchor 2
    bin2_start = int($5/res) * res
    bin2_end = bin2_start + res
    print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10
}' fine_res_loops.bedpe > harmonized_loops.bedpe

Step 5: Per-Sample Filtering

5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):

# Filter loops where EITHER anchor overlaps a blocklist region
# First, extract anchor 1 and anchor 2 as separate BED files
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed
awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed

# Find anchor rows NOT in blocklist
bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt
bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt

# Keep only rows where BOTH anchors pass
comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt
awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe

5b. Score Filtering

Filter by interaction score/significance:

# If BEDPE has a score column (col 8), filter to significant interactions
# Keep top 75% by score (true distribution quantile, not range-based)
TOTAL=$(wc -l < sample.filtered.bedpe)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k8,8n sample.filtered.bedpe |

hic-aggregation

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes