When to Use

User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue
User asks "where is chromatin accessible in my tissue?" or "build an open chromatin map"
User needs to merge accessibility data from different labs, donors, or platforms (ATAC vs DNase)
User wants a comprehensive set of open chromatin regions for regulatory element discovery
Example queries: "aggregate ATAC-seq peaks for pancreas", "combine DNase-seq across donors", "find all accessible regions in liver"

Aggregate Chromatin Accessibility Peaks Across Studies

Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments.

Scientific Rationale

The question: "Where is chromatin accessible in my tissue?"

Like histone marks, chromatin accessibility is a detection question. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the union of all detections.

ATAC-seq vs DNase-seq

Both measure open chromatin but with different biases:

Property	ATAC-seq	DNase-seq
Method	Tn5 transposase insertion	DNase I hypersensitivity
Input required	~50K cells	~1M cells
Resolution	High	High
GC bias	Moderate (Tn5 preference)	Low
Mitochondrial reads	High (filter needed)	None
ENCODE availability	Newer experiments	Extensive historical catalog
Comparability	Generally comparable at open regions

Literature Support

Corces et al. 2017 (Nature Methods, 733 citations): Established that ATAC-seq and DNase-seq identify largely overlapping accessible regions, with ATAC capturing ~75% of DNase sites. Both are valid for union maps.
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Comprehensive set of problematic genomic regions to filter. Essential for all functional genomics analyses. DOI
F-Seq2 (Zhao & Boyle 2020, NAR Genomics): Improved peak caller for DNase-seq and ATAC-seq with proper test statistics for IDR compatibility.
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated accessibility data with histone marks across tissues for chromatin state annotation.

Recommendation: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms.

Step 1: Find All Available Accessibility Data

# ATAC-seq
encode_search_experiments(
    assay_title="ATAC-seq",
    organ="pancreas",
    biosample_type="tissue",
    limit=100
)

# DNase-seq
encode_search_experiments(
    assay_title="DNase-seq",
    organ="pancreas",
    biosample_type="tissue",
    limit=100
)

Present a summary to the user:

Total ATAC-seq experiments
Total DNase-seq experiments
Labs represented
Whether to use one or both assay types

Combining ATAC + DNase?

Ask the user:

Same assay only (purest comparison, no cross-platform effects)
Both assays combined (maximum coverage, slight platform variation)

For a comprehensive accessibility catalog, combining both is scientifically justified.

Step 2: Quality-Gate Each Experiment

encode_get_experiment(accession="ENCSR...")

ATAC-seq Quality Checks

Audit status: no ERROR flags
Has IDR thresholded peaks
Low mitochondrial read fraction (ENCODE pipeline removes these)
Good TSS enrichment score
Nucleosome-free fragment enrichment visible

DNase-seq Quality Checks

Audit status: no ERROR flags
Has Hotspot2 peaks or IDR thresholded peaks
Adequate sequencing depth (20M+ mapped reads)
Signal-to-noise ratio

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download Peak Files

For each experiment:

# ATAC-seq — IDR thresholded peaks
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38"
)

# DNase-seq — may use different output types
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="peaks",
    assembly="GRCh38"
)

Prefer preferred_default=True files.

encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/accessibility",
    organize_by="flat"
)

Step 4: Per-Sample Noise Filtering

4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak

4b. SignalValue Filtering (Perna et al. 2024)

Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7):

# Per-sample: remove bottom 25% by signalValue (true distribution quantile)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak

4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional)

For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts:

awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak

Step 5: Union Merge

Accessibility peaks are narrow/point-source (like H3K4me3). Use default merge (overlap only, no gap tolerance).

CRITICAL: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks:

# Tag each sample's peaks with a unique sample ID
awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed
awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed
# ... repeat for all samples

# Concatenate all tagged peaks (ATAC + DNase combined or separate)
cat *.tagged.bed > all_accessibility.bed

# Sort
bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed

# Union merge — count UNIQUE SAMPLES (not peaks)
bedtools merge \
    -i all_accessibility.sorted.bed \
    -c 4,7,9 \
    -o count_distinct,max,max \
    > union_accessible_regions.bed
# Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue

If Tracking Assay Source

To annotate whether peaks came from ATAC, DNase, or both:

# Add assay tag to each peak before concatenation
awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed
awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed

# After merge, use bedtools multiIntersect to track sources
bedtools multiIntersect \
    -i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \
    -header \
    -names ATAC_1 ATAC_2 DNase_1 ... \
    > multi_intersect.bed

Step 6: Confidence Annotation

Same logic as histone aggregation. Given N total samples:

Confidence	Criteria	Interpretation
High	≥50% of samples	Constitutive accessible region
Supported	2+ samples	Likely real, some variation
Singleton	1 sample only	Keep — may be individual-specific or condition-specific

awk -v N=8 '{
    if ($4 >= N*0.5) conf="HIGH";
    else if ($4 >= 2) conf="SUPPORTED";
    else conf="SINGLETON";
    print $0"\t"conf"\t"$4"/"N
}' union_accessible_regions.bed > union_accessible_regions.annotated.bed

Step 7: Log Provenance

encode_log_derived_file(
    file_path="/path/to/union_accessible_regions.annotated.bed",
    source_accessions=["ENCSR...", "ENCSR...", ...],
    descriptio

accessibility-aggregation

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday