ENCODE Hi-C Pipeline: FASTQ to Contact Matrices and Loops
When to Use
- User wants to run a Hi-C processing pipeline from FASTQ to contact matrices and loop calls
- User asks about "Hi-C pipeline", "contact matrix", "loop calling", "Juicer", "HiCCUPS", or "TAD detection"
- User needs to process Hi-C data for 3D genome structure analysis
- Example queries: "process my Hi-C FASTQs", "generate contact matrices from Hi-C", "call chromatin loops with HiCCUPS"
Execute the ENCODE Hi-C pipeline for chromatin conformation capture data, producing multi-resolution contact matrices, loop calls, and compartment annotations.
Pipeline Overview
FASTQ -> Trim -> BWA (per-mate) -> pairtools parse -> dedup -> .pairs
|
+------------+------------+
| |
Juicer pre -> .hic cooler -> .mcool
| |
HiCCUPS loops Compartments
ENCODE Repository
- GitHub:
ENCODE-DCC/hic-pipeline - Container:
encodedcc/hic-pipeline - WDL: Available for Cromwell execution
- This skill: Nextflow DSL2 reimplementation for portability
Core Tools and Versions
| Tool | Version | Purpose | Citation |
|---|---|---|---|
| BWA-MEM | 0.7.17 | Alignment (per-mate) | Li & Durbin 2009 |
| pairtools | 1.0.3 | Pair classification, dedup | Open2C |
| Juicer tools | 2.20.00 | .hic generation, HiCCUPS | Durand et al. 2016 |
| cooler | 0.9.3 | .cool/.mcool generation | Abdennur & Mirny 2020 |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Key Literature
-
Rao et al. 2014 - "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping" (Cell, ~5,000 citations) DOI: 10.1016/j.cell.2014.11.021
-
Lieberman-Aiden et al. 2009 - "Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome" (Science, ~6,000 citations) DOI: 10.1126/science.1181369
-
Durand et al. 2016 - "Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments" (Cell Systems, ~2,000 citations) DOI: 10.1016/j.cels.2016.07.002
-
Abdennur & Mirny 2020 - "Cooler: scalable storage for Hi-C data and other genomically labeled arrays" (Bioinformatics) DOI: 10.1093/bioinformatics/btz540
-
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Execution
Quick Start (Local)
nextflow run main.nf \
-profile local \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bwa_index '/ref/bwa_index/genome.fa' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir results/ \
-resume
SLURM HPC
nextflow run main.nf \
-profile slurm \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bwa_index '/ref/bwa_index/genome.fa' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir results/ \
-resume
Cloud (GCP / AWS)
nextflow run main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
--bwa_index 'gs://bucket/ref/genome.fa' \
--chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir 'gs://bucket/results/' \
-resume
Resource Requirements
| Step | CPUs | RAM | Time (2B contacts) |
|---|---|---|---|
| BWA alignment | 8 | 16 GB | 4-6 hours |
| pairtools parse | 4 | 8 GB | 2-3 hours |
| pairtools dedup | 4 | 16 GB | 1-2 hours |
| Juicer pre + hic | 4 | 64 GB | 2-4 hours |
| HiCCUPS | 4 | 16 GB (+ GPU optional) | 1-2 hours |
| Total | 8 | 64 GB | 8-16 hours |
Pipeline Parameters
| Parameter | Default | Description |
|---|---|---|
--reads | required | Glob pattern to paired FASTQ files |
--bwa_index | required | Path to BWA genome index (.fa with .bwt etc.) |
--chrom_sizes | required | Chromosome sizes file |
--restriction_site | GATC | Restriction enzyme site (GATC for MboI/DpnII) |
--outdir | ./results | Output directory |
--resolutions | 1000,5000,10000,25000,50000,100000,250000,500000,1000000 | Matrix resolutions |
--min_mapq | 30 | Minimum MAPQ for pair filtering |
--assembly | hg38 | Genome assembly name for .hic header |
Output Files
results/
fastqc/ # Raw read quality
alignment/
{sample}.R1.bam # Per-mate alignments
{sample}.R2.bam
pairs/
{sample}.pairs.gz # Classified, deduplicated pairs
{sample}.dedup_stats.txt # Duplication metrics
{sample}.pair_stats.txt # Pair type classification
matrices/
{sample}.hic # Juicer .hic file (primary output)
{sample}.mcool # Cooler multi-resolution matrix
loops/
{sample}.hiccups_loops.bedpe # Called loops (HiCCUPS)
qc/
{sample}.contact_stats.txt # Contact statistics
multiqc/
multiqc_report.html
.hic File Format
The .hic format (Juicer) stores multi-resolution contact matrices with
normalization vectors. Can be visualized in Juicebox and loaded by
hic-straw in Python/R.
.mcool File Format
The .mcool format (cooler) is an HDF5-based multi-resolution contact matrix.
Widely supported by cooler, cooltools, HiGlass, and FAN-C.
QC Thresholds (ENCODE Standards)
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Valid pair fraction | >40% | 25-40% | <25% |
| Cis contacts (>20kb) | >40% | 25-40% | <25% |
| Cis/trans ratio | >1.5 | 1.0-1.5 | <1.0 |
| Library complexity (unique/total) | >0.7 | 0.5-0.7 | <0.5 |
| Contacts per resolution | See below | - | - |
Resolution vs Depth Requirements
| Resolution | Minimum Contacts Needed | Typical Depth |
|---|---|---|
| 1 kb | >2 billion | Very deep |
| 5 kb | >500 million | Deep |
| 10 kb | >200 million | Standard |
| 25 kb | >50 million | Moderate |
| 100 kb | >10 million | Low |
Pair Classification
pairtools classifies read pairs into categories:
| Category | Description | Use |
|---|---|---|
| UU | Both uniquely mapped | Valid contact |
| UR/RU | One unique, one rescued | Valid (rescued) |
| UX/XU | One unique, one unmapped | Not used |
| DD | Both duplicate | Removed |
| WW | Walk pair (same strand) | Indicates ligation artifact |
| NR | Null/rescue pair | Not used |
Only UU pairs (and optionally UR) are used for contact matrices.
Critical Pitfalls
Restriction Enzyme Choice
The restriction enzyme determines fragment size and resolution:
- MboI/DpnII (GATC): 4-cutter, ~256 bp average fragment -- higher resolution
- HindIII (AAGCTT): 6-cutter, ~4 kb average fragment -- lower resolution
- Arima (proprietary): Two enzymes, ~160 bp average -- highest resolution
- Always verify which enzyme was used before processing
Normalization Method
Different normalization methods yield different results:
- KR (Knight-Ruiz): Default in Juicer, balanced normalization
- ICE (Imakaev et al.): Used by cooler/cooltools, iterative correction
- VC (Vanilla Coverage): Simple coverage normalization
- ENCODE standard: KR normalization. Always document which was used.
Resolution Depends on Depth
Do not call features at resolutions unsupported by sequencing depth:
- Calling 1 kb loops from 100M contacts will produce noise
- Check the Juicer resolution QC to determine achievable resolution
- Loop calling (HiCCUPS) typical