Peak Annotation of ENCODE Data

When to Use

User wants to annotate genomic peaks with nearby genes, regulatory features, or functional categories
User asks about "peak annotation", "ChIPseeker", "gene assignment", or "peak-to-gene mapping"
User needs to classify peaks as promoter, enhancer, intronic, intergenic, etc.
User wants to run GO/pathway enrichment on genes near their peaks
Example queries: "annotate my H3K27ac peaks with nearby genes", "what genes are near these ATAC-seq peaks?", "run pathway analysis on peak-associated genes"

Help the user annotate ENCODE peak calls with genomic features and functional enrichment. Peak annotation bridges the gap between regulatory elements (peaks) and biological function (genes, pathways). This skill covers two complementary approaches: ChIPseeker for genomic feature annotation and visualization, and GREAT for functional enrichment analysis of non-coding regions.

Literature Foundation

Reference	Journal	Key Contribution	DOI	Citations
Yu et al. (2015)	Bioinformatics	ChIPseeker: R/Bioconductor package for ChIP peak annotation, comparison, and visualization	10.1093/bioinformatics/btv145	~3,200
McLean et al. (2010)	Nature Biotechnology	GREAT: Genomic Regions Enrichment of Annotations Tool; assigns biological meaning to cis-regulatory regions using basal+extension gene association	10.1038/nbt.1630	~2,800
Zhu et al. (2010)	BMC Bioinformatics	ChIPpeakAnno: Bioconductor package for ChIP-seq/ChIP-chip annotation; pioneered peak-gene association	10.1186/1471-2105-11-237	~1,200
Tanigawa et al. (2022)	PLOS Computational Biology	rGREAT: R/Bioconductor interface to GREAT; enables programmatic enrichment analysis	10.1371/journal.pcbi.1010378	~80
Amemiya et al. (2019)	Scientific Reports	ENCODE Blacklist: artifact regions to exclude before annotation to avoid spurious gene associations	10.1038/s41598-019-45839-z	~1,372

Prerequisites: Obtaining ENCODE Peaks

Search for and download peak files before annotation:

encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    biosample_type="tissue"
)

encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38",
    preferred_default=True
)

encode_download_files(
    file_accessions=["ENCFF..."],
    download_dir="/data/peak_annotation/"
)

Pre-annotation filtering: Always remove blacklisted regions before annotation:

bedtools intersect -a peaks.narrowPeak -b hg38-blacklist.v2.bed -v > peaks_clean.narrowPeak

Part 1: ChIPseeker Annotation

ChIPseeker (Yu et al. 2015) annotates peaks with genomic features (promoter, UTR, exon, intron, intergenic) and provides publication-ready visualizations of the annotation distribution.

1a. Basic Annotation Workflow

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(clusterProfiler)

# Load peak file
peaks <- readPeakFile("H3K27ac_peaks_clean.narrowPeak")

# Set transcript database (must match genome assembly)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# Annotate peaks with genomic features
peakAnno <- annotatePeak(
    peaks,
    TxDb = txdb,
    annoDb = "org.Hs.eg.db",
    level = "gene",
    tssRegion = c(-3000, 3000)
)

# View annotation summary
peakAnno

1b. Understanding Annotation Categories

ChIPseeker classifies each peak into one genomic feature based on priority:

Priority	Feature	Definition
1	Promoter	Within TSS region (default: TSS +/- 3kb)
2	5' UTR	Overlapping 5' untranslated region
3	3' UTR	Overlapping 3' untranslated region
4	Exon	Overlapping exonic sequence
5	Intron	Within intronic sequence
6	Downstream	Within 3kb downstream of gene end
7	Distal Intergenic	Everything else (>3kb from any gene)

Each peak receives exactly one annotation based on the highest-priority feature it overlaps.

Promoter sub-categories: ChIPseeker can further subdivide promoter peaks:

peakAnno <- annotatePeak(
    peaks,
    TxDb = txdb,
    annoDb = "org.Hs.eg.db",
    level = "gene",
    tssRegion = c(-3000, 3000),
    # Subdivide promoter into bins
    genomicAnnotationPriority = c(
        "Promoter", "5UTR", "3UTR", "Exon", "Intron", "Downstream", "Intergenic"
    )
)

1c. Visualization Functions

ChIPseeker provides several publication-ready visualization functions:

Genomic Feature Distribution Bar Plot:

plotAnnoBar(peakAnno) +
    theme_minimal(base_size = 14) +
    ggtitle("H3K27ac Genomic Feature Distribution")
ggsave("annotation_barplot.pdf", width = 10, height = 5)

Genomic Feature Distribution Pie Chart:

plotAnnoPie(peakAnno)

Distance to TSS Distribution:

plotDistToTSS(
    peakAnno,
    title = "H3K27ac Distance to Nearest TSS"
) + theme_minimal(base_size = 14)
ggsave("tss_distance.pdf", width = 8, height = 5)

Peak Coverage Across Chromosomes:

covplot(peaks, weightCol = "V5") +
    ggtitle("H3K27ac Peak Coverage")
ggsave("chromosome_coverage.pdf", width = 12, height = 6)

TSS-Centered Heatmap:

# Tag matrix around TSS
tagMatrix <- getTagMatrix(peaks, windows = promoter)
tagHeatmap(tagMatrix, xlim = c(-3000, 3000), color = "#FF8000")

1d. Extracting Annotated Gene Lists

# Convert annotation to data frame
anno_df <- as.data.frame(peakAnno)

# Get genes associated with promoter peaks
promoter_genes <- anno_df[grepl("Promoter", anno_df$annotation), "SYMBOL"]
promoter_genes <- unique(promoter_genes[!is.na(promoter_genes)])

# Get all annotated genes (any feature overlap)
all_genes <- unique(anno_df$SYMBOL[!is.na(anno_df$SYMBOL)])

# Export for downstream analysis
write.csv(anno_df, "peak_annotations.csv", row.names = FALSE)
write.table(promoter_genes, "promoter_genes.txt",
            row.names = FALSE, col.names = FALSE, quote = FALSE)

1e. Gene Ontology Enrichment with clusterProfiler

After extracting gene lists, perform GO enrichment:

# Convert gene symbols to Entrez IDs
gene_ids <- bitr(promoter_genes, fromType = "SYMBOL",
                 toType = "ENTREZID", OrgDb = org.Hs.eg.db)

# GO Biological Process enrichment
ego <- enrichGO(
    gene = gene_ids$ENTREZID,
    OrgDb = org.Hs.eg.db,
    ont = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.05,
    readable = TRUE
)

# Visualize enrichment
dotplot(ego, showCategory = 20) +
    ggtitle("GO Biological Process Enrichment")
ggsave("go_enrichment.pdf", width = 10, height = 8)

# KEGG pathway enrichment
ekegg <- enrichKEGG(
    gene = gene_ids$ENTREZID,
    organism = "hsa",
    pvalueCutoff = 0.05
)

Part 2: GREAT Enrichment

GREAT (McLean et al. 2010) assigns biological meaning to sets of non-coding genomic regions. Unlike ChIPseeker (which annotates individual peaks to nearest genes), GREAT uses a sophisticated gene association rule that accounts for gene density variation across the genome.

2a. GREAT Gene Association Rules

GREAT associates genomic regions to genes using a "basal plus extension" rule:

Basal domain: Each gene gets a default regulatory domain of 5kb upstream and 1kb downstream of the TSS
Extension: The basal domain extends up to 1Mb in both directions until it encounters another gene's basal domain
Curated domains: Known regulatory domains from the literature override the default rules

This approach is superior to simple nearest-gene assignment because it accounts for the fact that genes in gene-dense regions h

peak-annotation

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday