ENCODE Hi-C Pipeline: FASTQ to Contact Matrices and Loops

When to Use

User wants to run a Hi-C processing pipeline from FASTQ to contact matrices and loop calls
User asks about "Hi-C pipeline", "contact matrix", "loop calling", "Juicer", "HiCCUPS", or "TAD detection"
User needs to process Hi-C data for 3D genome structure analysis
Example queries: "process my Hi-C FASTQs", "generate contact matrices from Hi-C", "call chromatin loops with HiCCUPS"

Execute the ENCODE Hi-C pipeline for chromatin conformation capture data, producing multi-resolution contact matrices, loop calls, and compartment annotations.

Pipeline Overview

FASTQ -> Trim -> BWA (per-mate) -> pairtools parse -> dedup -> .pairs
                                                                 |
                                                    +------------+------------+
                                                    |                         |
                                              Juicer pre -> .hic        cooler -> .mcool
                                                    |                         |
                                              HiCCUPS loops              Compartments

ENCODE Repository

GitHub: ENCODE-DCC/hic-pipeline
Container: encodedcc/hic-pipeline
WDL: Available for Cromwell execution
This skill: Nextflow DSL2 reimplementation for portability

Core Tools and Versions

Tool	Version	Purpose	Citation
BWA-MEM	0.7.17	Alignment (per-mate)	Li & Durbin 2009
pairtools	1.0.3	Pair classification, dedup	Open2C
Juicer tools	2.20.00	.hic generation, HiCCUPS	Durand et al. 2016
cooler	0.9.3	.cool/.mcool generation	Abdennur & Mirny 2020
samtools	1.19	BAM operations	Li et al. 2009
FastQC	0.12.1	Read quality	Andrews (Babraham)
MultiQC	1.21	Aggregated QC	Ewels et al. 2016

Key Literature

Rao et al. 2014 - "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping" (Cell, ~5,000 citations) DOI: 10.1016/j.cell.2014.11.021
Lieberman-Aiden et al. 2009 - "Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome" (Science, ~6,000 citations) DOI: 10.1126/science.1181369
Durand et al. 2016 - "Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments" (Cell Systems, ~2,000 citations) DOI: 10.1016/j.cels.2016.07.002
Abdennur & Mirny 2020 - "Cooler: scalable storage for Hi-C data and other genomically labeled arrays" (Bioinformatics) DOI: 10.1093/bioinformatics/btz540
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z

Execution

Quick Start (Local)

nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --restriction_site 'GATC' \
    --outdir results/ \
    -resume

SLURM HPC

nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --restriction_site 'GATC' \
    --outdir results/ \
    -resume

Cloud (GCP / AWS)

nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index 'gs://bucket/ref/genome.fa' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --restriction_site 'GATC' \
    --outdir 'gs://bucket/results/' \
    -resume

Resource Requirements

Step	CPUs	RAM	Time (2B contacts)
BWA alignment	8	16 GB	4-6 hours
pairtools parse	4	8 GB	2-3 hours
pairtools dedup	4	16 GB	1-2 hours
Juicer pre + hic	4	64 GB	2-4 hours
HiCCUPS	4	16 GB (+ GPU optional)	1-2 hours
Total	8	64 GB	8-16 hours

Pipeline Parameters

Parameter	Default	Description
`--reads`	required	Glob pattern to paired FASTQ files
`--bwa_index`	required	Path to BWA genome index (.fa with .bwt etc.)
`--chrom_sizes`	required	Chromosome sizes file
`--restriction_site`	`GATC`	Restriction enzyme site (GATC for MboI/DpnII)
`--outdir`	`./results`	Output directory
`--resolutions`	`1000,5000,10000,25000,50000,100000,250000,500000,1000000`	Matrix resolutions
`--min_mapq`	`30`	Minimum MAPQ for pair filtering
`--assembly`	`hg38`	Genome assembly name for .hic header

Output Files

results/
  fastqc/                         # Raw read quality
  alignment/
    {sample}.R1.bam               # Per-mate alignments
    {sample}.R2.bam
  pairs/
    {sample}.pairs.gz             # Classified, deduplicated pairs
    {sample}.dedup_stats.txt      # Duplication metrics
    {sample}.pair_stats.txt       # Pair type classification
  matrices/
    {sample}.hic                  # Juicer .hic file (primary output)
    {sample}.mcool                # Cooler multi-resolution matrix
  loops/
    {sample}.hiccups_loops.bedpe  # Called loops (HiCCUPS)
  qc/
    {sample}.contact_stats.txt    # Contact statistics
  multiqc/
    multiqc_report.html

.hic File Format

The .hic format (Juicer) stores multi-resolution contact matrices with normalization vectors. Can be visualized in Juicebox and loaded by hic-straw in Python/R.

.mcool File Format

The .mcool format (cooler) is an HDF5-based multi-resolution contact matrix. Widely supported by cooler, cooltools, HiGlass, and FAN-C.

QC Thresholds (ENCODE Standards)

Metric	Pass	Warning	Fail
Valid pair fraction	>40%	25-40%	<25%
Cis contacts (>20kb)	>40%	25-40%	<25%
Cis/trans ratio	>1.5	1.0-1.5	<1.0
Library complexity (unique/total)	>0.7	0.5-0.7	<0.5
Contacts per resolution	See below	-	-

Resolution vs Depth Requirements

Resolution	Minimum Contacts Needed	Typical Depth
1 kb	>2 billion	Very deep
5 kb	>500 million	Deep
10 kb	>200 million	Standard
25 kb	>50 million	Moderate
100 kb	>10 million	Low

Pair Classification

pairtools classifies read pairs into categories:

Category	Description	Use
UU	Both uniquely mapped	Valid contact
UR/RU	One unique, one rescued	Valid (rescued)
UX/XU	One unique, one unmapped	Not used
DD	Both duplicate	Removed
WW	Walk pair (same strand)	Indicates ligation artifact
NR	Null/rescue pair	Not used

Only UU pairs (and optionally UR) are used for contact matrices.

Critical Pitfalls

Restriction Enzyme Choice

The restriction enzyme determines fragment size and resolution:

MboI/DpnII (GATC): 4-cutter, ~256 bp average fragment -- higher resolution
HindIII (AAGCTT): 6-cutter, ~4 kb average fragment -- lower resolution
Arima (proprietary): Two enzymes, ~160 bp average -- highest resolution
Always verify which enzyme was used before processing

Normalization Method

Different normalization methods yield different results:

KR (Knight-Ruiz): Default in Juicer, balanced normalization
ICE (Imakaev et al.): Used by cooler/cooltools, iterative correction
VC (Vanilla Coverage): Simple coverage normalization
ENCODE standard: KR normalization. Always document which was used.

Resolution Depends on Depth

Do not call features at resolutions unsupported by sequencing depth:

Calling 1 kb loops from 100M contacts will produce noise
Check the Juicer resolution QC to determine achievable resolution
Loop calling (HiCCUPS) typical

pipeline-hic

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday