When to Use
- User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue
- User asks "where is chromatin accessible in my tissue?" or "build an open chromatin map"
- User needs to merge accessibility data from different labs, donors, or platforms (ATAC vs DNase)
- User wants a comprehensive set of open chromatin regions for regulatory element discovery
- Example queries: "aggregate ATAC-seq peaks for pancreas", "combine DNase-seq across donors", "find all accessible regions in liver"
Aggregate Chromatin Accessibility Peaks Across Studies
Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments.
Scientific Rationale
The question: "Where is chromatin accessible in my tissue?"
Like histone marks, chromatin accessibility is a detection question. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the union of all detections.
ATAC-seq vs DNase-seq
Both measure open chromatin but with different biases:
| Property | ATAC-seq | DNase-seq |
|---|---|---|
| Method | Tn5 transposase insertion | DNase I hypersensitivity |
| Input required | ~50K cells | ~1M cells |
| Resolution | High | High |
| GC bias | Moderate (Tn5 preference) | Low |
| Mitochondrial reads | High (filter needed) | None |
| ENCODE availability | Newer experiments | Extensive historical catalog |
| Comparability | Generally comparable at open regions |
Literature Support
- Corces et al. 2017 (Nature Methods, 733 citations): Established that ATAC-seq and DNase-seq identify largely overlapping accessible regions, with ATAC capturing ~75% of DNase sites. Both are valid for union maps.
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Comprehensive set of problematic genomic regions to filter. Essential for all functional genomics analyses. DOI
- F-Seq2 (Zhao & Boyle 2020, NAR Genomics): Improved peak caller for DNase-seq and ATAC-seq with proper test statistics for IDR compatibility.
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated accessibility data with histone marks across tissues for chromatin state annotation.
Recommendation: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms.
Step 1: Find All Available Accessibility Data
# ATAC-seq
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
biosample_type="tissue",
limit=100
)
# DNase-seq
encode_search_experiments(
assay_title="DNase-seq",
organ="pancreas",
biosample_type="tissue",
limit=100
)
Present a summary to the user:
- Total ATAC-seq experiments
- Total DNase-seq experiments
- Labs represented
- Whether to use one or both assay types
Combining ATAC + DNase?
Ask the user:
- Same assay only (purest comparison, no cross-platform effects)
- Both assays combined (maximum coverage, slight platform variation)
For a comprehensive accessibility catalog, combining both is scientifically justified.
Step 2: Quality-Gate Each Experiment
encode_get_experiment(accession="ENCSR...")
ATAC-seq Quality Checks
- Audit status: no ERROR flags
- Has IDR thresholded peaks
- Low mitochondrial read fraction (ENCODE pipeline removes these)
- Good TSS enrichment score
- Nucleosome-free fragment enrichment visible
DNase-seq Quality Checks
- Audit status: no ERROR flags
- Has Hotspot2 peaks or IDR thresholded peaks
- Adequate sequencing depth (20M+ mapped reads)
- Signal-to-noise ratio
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download Peak Files
For each experiment:
# ATAC-seq — IDR thresholded peaks
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
# DNase-seq — may use different output types
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="peaks",
assembly="GRCh38"
)
Prefer preferred_default=True files.
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/accessibility",
organize_by="flat"
)
Step 4: Per-Sample Noise Filtering
4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak
4b. SignalValue Filtering (Perna et al. 2024)
Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7):
# Per-sample: remove bottom 25% by signalValue (true distribution quantile)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak
4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional)
For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts:
awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak
Step 5: Union Merge
Accessibility peaks are narrow/point-source (like H3K4me3). Use default merge (overlap only, no gap tolerance).
CRITICAL: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks:
# Tag each sample's peaks with a unique sample ID
awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed
awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed
# ... repeat for all samples
# Concatenate all tagged peaks (ATAC + DNase combined or separate)
cat *.tagged.bed > all_accessibility.bed
# Sort
bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed
# Union merge — count UNIQUE SAMPLES (not peaks)
bedtools merge \
-i all_accessibility.sorted.bed \
-c 4,7,9 \
-o count_distinct,max,max \
> union_accessible_regions.bed
# Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue
If Tracking Assay Source
To annotate whether peaks came from ATAC, DNase, or both:
# Add assay tag to each peak before concatenation
awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed
awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed
# After merge, use bedtools multiIntersect to track sources
bedtools multiIntersect \
-i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \
-header \
-names ATAC_1 ATAC_2 DNase_1 ... \
> multi_intersect.bed
Step 6: Confidence Annotation
Same logic as histone aggregation. Given N total samples:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | ≥50% of samples | Constitutive accessible region |
| Supported | 2+ samples | Likely real, some variation |
| Singleton | 1 sample only | Keep — may be individual-specific or condition-specific |
awk -v N=8 '{
if ($4 >= N*0.5) conf="HIGH";
else if ($4 >= 2) conf="SUPPORTED";
else conf="SINGLETON";
print $0"\t"conf"\t"$4"/"N
}' union_accessible_regions.bed > union_accessible_regions.annotated.bed
Step 7: Log Provenance
encode_log_derived_file(
file_path="/path/to/union_accessible_regions.annotated.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
descriptio