Peak Annotation of ENCODE Data
When to Use
- User wants to annotate genomic peaks with nearby genes, regulatory features, or functional categories
- User asks about "peak annotation", "ChIPseeker", "gene assignment", or "peak-to-gene mapping"
- User needs to classify peaks as promoter, enhancer, intronic, intergenic, etc.
- User wants to run GO/pathway enrichment on genes near their peaks
- Example queries: "annotate my H3K27ac peaks with nearby genes", "what genes are near these ATAC-seq peaks?", "run pathway analysis on peak-associated genes"
Help the user annotate ENCODE peak calls with genomic features and functional enrichment. Peak annotation bridges the gap between regulatory elements (peaks) and biological function (genes, pathways). This skill covers two complementary approaches: ChIPseeker for genomic feature annotation and visualization, and GREAT for functional enrichment analysis of non-coding regions.
Literature Foundation
| Reference | Journal | Key Contribution | DOI | Citations |
|---|---|---|---|---|
| Yu et al. (2015) | Bioinformatics | ChIPseeker: R/Bioconductor package for ChIP peak annotation, comparison, and visualization | 10.1093/bioinformatics/btv145 | ~3,200 |
| McLean et al. (2010) | Nature Biotechnology | GREAT: Genomic Regions Enrichment of Annotations Tool; assigns biological meaning to cis-regulatory regions using basal+extension gene association | 10.1038/nbt.1630 | ~2,800 |
| Zhu et al. (2010) | BMC Bioinformatics | ChIPpeakAnno: Bioconductor package for ChIP-seq/ChIP-chip annotation; pioneered peak-gene association | 10.1186/1471-2105-11-237 | ~1,200 |
| Tanigawa et al. (2022) | PLOS Computational Biology | rGREAT: R/Bioconductor interface to GREAT; enables programmatic enrichment analysis | 10.1371/journal.pcbi.1010378 | ~80 |
| Amemiya et al. (2019) | Scientific Reports | ENCODE Blacklist: artifact regions to exclude before annotation to avoid spurious gene associations | 10.1038/s41598-019-45839-z | ~1,372 |
Prerequisites: Obtaining ENCODE Peaks
Search for and download peak files before annotation:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
biosample_type="tissue"
)
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38",
preferred_default=True
)
encode_download_files(
file_accessions=["ENCFF..."],
download_dir="/data/peak_annotation/"
)
Pre-annotation filtering: Always remove blacklisted regions before annotation:
bedtools intersect -a peaks.narrowPeak -b hg38-blacklist.v2.bed -v > peaks_clean.narrowPeak
Part 1: ChIPseeker Annotation
ChIPseeker (Yu et al. 2015) annotates peaks with genomic features (promoter, UTR, exon, intron, intergenic) and provides publication-ready visualizations of the annotation distribution.
1a. Basic Annotation Workflow
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(clusterProfiler)
# Load peak file
peaks <- readPeakFile("H3K27ac_peaks_clean.narrowPeak")
# Set transcript database (must match genome assembly)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# Annotate peaks with genomic features
peakAnno <- annotatePeak(
peaks,
TxDb = txdb,
annoDb = "org.Hs.eg.db",
level = "gene",
tssRegion = c(-3000, 3000)
)
# View annotation summary
peakAnno
1b. Understanding Annotation Categories
ChIPseeker classifies each peak into one genomic feature based on priority:
| Priority | Feature | Definition |
|---|---|---|
| 1 | Promoter | Within TSS region (default: TSS +/- 3kb) |
| 2 | 5' UTR | Overlapping 5' untranslated region |
| 3 | 3' UTR | Overlapping 3' untranslated region |
| 4 | Exon | Overlapping exonic sequence |
| 5 | Intron | Within intronic sequence |
| 6 | Downstream | Within 3kb downstream of gene end |
| 7 | Distal Intergenic | Everything else (>3kb from any gene) |
Each peak receives exactly one annotation based on the highest-priority feature it overlaps.
Promoter sub-categories: ChIPseeker can further subdivide promoter peaks:
peakAnno <- annotatePeak(
peaks,
TxDb = txdb,
annoDb = "org.Hs.eg.db",
level = "gene",
tssRegion = c(-3000, 3000),
# Subdivide promoter into bins
genomicAnnotationPriority = c(
"Promoter", "5UTR", "3UTR", "Exon", "Intron", "Downstream", "Intergenic"
)
)
1c. Visualization Functions
ChIPseeker provides several publication-ready visualization functions:
Genomic Feature Distribution Bar Plot:
plotAnnoBar(peakAnno) +
theme_minimal(base_size = 14) +
ggtitle("H3K27ac Genomic Feature Distribution")
ggsave("annotation_barplot.pdf", width = 10, height = 5)
Genomic Feature Distribution Pie Chart:
plotAnnoPie(peakAnno)
Distance to TSS Distribution:
plotDistToTSS(
peakAnno,
title = "H3K27ac Distance to Nearest TSS"
) + theme_minimal(base_size = 14)
ggsave("tss_distance.pdf", width = 8, height = 5)
Peak Coverage Across Chromosomes:
covplot(peaks, weightCol = "V5") +
ggtitle("H3K27ac Peak Coverage")
ggsave("chromosome_coverage.pdf", width = 12, height = 6)
TSS-Centered Heatmap:
# Tag matrix around TSS
tagMatrix <- getTagMatrix(peaks, windows = promoter)
tagHeatmap(tagMatrix, xlim = c(-3000, 3000), color = "#FF8000")
1d. Extracting Annotated Gene Lists
# Convert annotation to data frame
anno_df <- as.data.frame(peakAnno)
# Get genes associated with promoter peaks
promoter_genes <- anno_df[grepl("Promoter", anno_df$annotation), "SYMBOL"]
promoter_genes <- unique(promoter_genes[!is.na(promoter_genes)])
# Get all annotated genes (any feature overlap)
all_genes <- unique(anno_df$SYMBOL[!is.na(anno_df$SYMBOL)])
# Export for downstream analysis
write.csv(anno_df, "peak_annotations.csv", row.names = FALSE)
write.table(promoter_genes, "promoter_genes.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
1e. Gene Ontology Enrichment with clusterProfiler
After extracting gene lists, perform GO enrichment:
# Convert gene symbols to Entrez IDs
gene_ids <- bitr(promoter_genes, fromType = "SYMBOL",
toType = "ENTREZID", OrgDb = org.Hs.eg.db)
# GO Biological Process enrichment
ego <- enrichGO(
gene = gene_ids$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE
)
# Visualize enrichment
dotplot(ego, showCategory = 20) +
ggtitle("GO Biological Process Enrichment")
ggsave("go_enrichment.pdf", width = 10, height = 8)
# KEGG pathway enrichment
ekegg <- enrichKEGG(
gene = gene_ids$ENTREZID,
organism = "hsa",
pvalueCutoff = 0.05
)
Part 2: GREAT Enrichment
GREAT (McLean et al. 2010) assigns biological meaning to sets of non-coding genomic regions. Unlike ChIPseeker (which annotates individual peaks to nearest genes), GREAT uses a sophisticated gene association rule that accounts for gene density variation across the genome.
2a. GREAT Gene Association Rules
GREAT associates genomic regions to genes using a "basal plus extension" rule:
- Basal domain: Each gene gets a default regulatory domain of 5kb upstream and 1kb downstream of the TSS
- Extension: The basal domain extends up to 1Mb in both directions until it encounters another gene's basal domain
- Curated domains: Known regulatory domains from the literature override the default rules
This approach is superior to simple nearest-gene assignment because it accounts for the fact that genes in gene-dense regions h