Aggregate Hi-C Chromatin Contacts Across Studies
When to Use
- User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
- User asks "what regions are in 3D contact in my tissue?" or "aggregate loop calls across donors"
- User needs a union catalog of BEDPE loops with resolution-aware anchor matching
- User wants to identify high-confidence loops supported by multiple experiments
- Example queries: "aggregate Hi-C loops for K562", "combine chromatin contacts across labs", "find consensus TAD boundaries in liver"
Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.
Scientific Rationale
The question: "What regions are in 3D physical contact in my tissue?"
Like histone marks and accessibility, chromatin loops are a detection question. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the union of all detected contacts.
Key Concepts
Hi-C data measures pairwise chromatin interactions genome-wide. After processing:
- Contact matrix (
.hicfile): Genome-wide interaction frequencies at multiple resolutions - Loop calls (BEDPE): Statistically significant point interactions (loops) identified by algorithms like HICCUPS or Juicer
- TAD boundaries: Topologically associating domain boundaries
- Compartments: A/B compartment assignments
BEDPE format (Paired-End BED):
chr1 start1 end1 chr2 start2 end2 name score strand1 strand2
Each row represents a contact between two genomic anchor regions.
Literature Support
- Loop Catalog (Reyna et al. 2025, Nucleic Acids Research): Created a union catalog of 4.19M unique loops across 1,089 Hi-C datasets. Demonstrated that union approach captures tissue-specific and constitutive loops. Used resolution-aware merging at 5kb, 10kb, and 25kb bins.
- AQuA Tools (Chakraborty et al. 2025): Toolkit for BEDPE intersection, union, and annotation. Handles paired-region arithmetic.
- mariner (Flores et al. 2024, Bioinformatics): R/Bioconductor package for BEDPE manipulation including merging loops across experiments with configurable anchor tolerance.
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated Hi-C data across tissues to define regulatory loops connecting enhancers to promoters.
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter from loop anchors. DOI
- Mustache (Roayaei Ardakany et al. 2020, Genome Biology, 165 citations): Multi-scale loop caller that recovers more validated loops than HICCUPS. Different callers produce discordant loop sets.
- Wolff et al. 2022 (GigaScience): Benchmark showing loop callers intersect by ~50% at most — critical context for why union approach is necessary.
Step 1: Find All Available Hi-C Data
encode_search_experiments(
assay_title="Hi-C",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=100
)
Present a summary to the user:
- Total Hi-C experiments
- Labs represented
- Unique donors/biosamples
- Resolution(s) available (check experiment metadata)
Use encode_get_facets to check availability:
encode_get_facets(assay_title="Hi-C", organ="pancreas")
Note: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.
Step 2: Quality-Gate Each Experiment
encode_get_experiment(accession="ENCSR...")
Hi-C Quality Checks
- Audit status: no ERROR flags
- Sequencing depth: 400M+ valid read pairs for loop calling (ENCODE standard)
- Cis/trans ratio: >60% cis contacts expected (low cis suggests noisy library)
- Hi-C-specific QC: Library complexity, PCR duplicate rate
- Has loop calls (BEDPE output) — not all Hi-C experiments have called loops
- Resolution: at least 5-10kb resolution for loop detection
Include if:
- Has BEDPE loop calls at consistent resolution
- Passes ENCODE audit (no ERROR flags)
- Adequate sequencing depth for loop resolution
Exclude if:
- ERROR audit flags
- Only contact matrices without loop calls
- Very low sequencing depth (<200M valid pairs — insufficient for loop calling)
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download Loop Call Files
For each experiment, get BEDPE loop calls:
# Search for loop/interaction files
encode_list_files(
experiment_accession="ENCSR...",
file_format="bedpe",
assembly="GRCh38"
)
# Also check for BED-formatted loop files
encode_list_files(
experiment_accession="ENCSR...",
output_type="chromatin interactions",
assembly="GRCh38"
)
# Or contact domains
encode_list_files(
experiment_accession="ENCSR...",
output_type="contact domains",
assembly="GRCh38"
)
File selection priority:
- Chromatin interactions (loop calls from HICCUPS or similar)
- Contact domains (TADs — different analysis, handle separately)
- Replicated loops (if available)
Prefer preferred_default=True files when available.
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/hic_loops",
organize_by="flat"
)
Step 4: Understanding Hi-C Resolution and Anchors
Critical: Resolution-Aware Processing
Hi-C loop anchors are binned regions, not precise positions. The resolution determines anchor size:
| Resolution | Anchor Width | Best For | Typical Loop Count |
|---|---|---|---|
| 5 kb | 5,000 bp | Fine-scale promoter-enhancer loops | More loops |
| 10 kb | 10,000 bp | Standard analysis | Moderate |
| 25 kb | 25,000 bp | Large-scale domain contacts | Fewer loops |
All loops being merged must be at the same resolution, or anchors must be harmonized to a common resolution.
Harmonizing Resolution
If experiments have loops called at different resolutions:
# Expand 5kb anchors to 10kb resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
# Bin anchor 1
bin1_start = int($2/res) * res
bin1_end = bin1_start + res
# Bin anchor 2
bin2_start = int($5/res) * res
bin2_end = bin2_start + res
print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10
}' fine_res_loops.bedpe > harmonized_loops.bedpe
Step 5: Per-Sample Filtering
5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):
# Filter loops where EITHER anchor overlaps a blocklist region
# First, extract anchor 1 and anchor 2 as separate BED files
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed
awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed
# Find anchor rows NOT in blocklist
bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt
bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt
# Keep only rows where BOTH anchors pass
comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt
awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe
5b. Score Filtering
Filter by interaction score/significance:
# If BEDPE has a score column (col 8), filter to significant interactions
# Keep top 75% by score (true distribution quantile, not range-based)
TOTAL=$(wc -l < sample.filtered.bedpe)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k8,8n sample.filtered.bedpe |