Motif Analysis of ENCODE Peak Data
When to Use
- User wants to discover transcription factor binding motifs in ChIP-seq or ATAC-seq peaks
- User asks about "motif enrichment", "HOMER", "MEME", or "de novo motif discovery"
- User needs to validate ChIP-seq targets by checking if the expected motif is enriched
- User wants to find co-binding partners or co-factor motifs in peak regions
- Example queries: "find motifs in my CTCF peaks", "run HOMER on ATAC-seq peaks", "what TFs co-bind with p300 in liver?"
Help the user perform de novo and known motif enrichment analysis on ENCODE ChIP-seq and ATAC-seq peaks. Motif analysis serves two critical purposes: (1) validating that ChIP-seq experiments pulled down the expected transcription factor, and (2) discovering co-regulatory partners that co-bind with the target factor. This skill covers the two major tool suites -- HOMER and MEME Suite -- from input preparation through result interpretation.
Literature Foundation
| Reference | Journal | Key Contribution | DOI | Citations |
|---|---|---|---|---|
| Heinz et al. (2010) | Molecular Cell | HOMER: Simple combinations of lineage-determining TFs prime cis-regulatory elements; introduced findMotifsGenome.pl for ChIP-seq motif analysis | 10.1016/j.molcel.2010.05.004 | ~6,000 |
| Bailey et al. (2009) | Nucleic Acids Research | MEME Suite: comprehensive tools for motif discovery (MEME), enrichment (AME), scanning (FIMO), and spacing (SpaMo) | 10.1093/nar/gkp335 | ~2,500 |
| Bailey & Elkan (1994) | ISMB | Foundational MEME algorithm: expectation maximization for discovering ungapped motifs in biopolymers | PMID: 7584402 | ~4,000 |
| Machanick & Bailey (2011) | Bioinformatics | MEME-ChIP: all-in-one motif analysis pipeline optimized for large ChIP-seq datasets | 10.1093/bioinformatics/btr189 | ~1,800 |
| Fornes et al. (2020) | Nucleic Acids Research | JASPAR 2020: curated, non-redundant TF binding profile database; standard reference for known motifs | 10.1093/nar/gkz1001 | ~2,200 |
| Amemiya et al. (2019) | Scientific Reports | ENCODE Blacklist: regions producing artifact signal that can generate spurious motif hits | 10.1038/s41598-019-45839-z | ~1,372 |
Prerequisites: Input Preparation
Obtaining ENCODE Peaks
Search for and download ChIP-seq or ATAC-seq peaks:
encode_search_experiments(
assay_title="TF ChIP-seq",
target="CTCF",
organ="pancreas",
biosample_type="tissue"
)
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38",
preferred_default=True
)
encode_download_files(
file_accessions=["ENCFF..."],
download_dir="/data/motif_analysis/"
)
Preparing Sequences from Peaks
Motif analysis requires DNA sequences, not just genomic coordinates. Extract sequences centered on peak summits:
# For TF ChIP-seq: extract summit +/- 100bp (200bp window)
awk 'BEGIN{OFS="\t"} {summit=$2+$10; print $1, summit-100, summit+100, $4, $5}' \
peaks.narrowPeak > summits_200bp.bed
# Remove blacklisted regions (Amemiya et al. 2019)
bedtools intersect -a summits_200bp.bed \
-b hg38-blacklist.v2.bed -v > summits_clean.bed
# Extract FASTA sequences (requires genome FASTA)
bedtools getfasta -fi hg38.fa -bed summits_clean.bed -fo summits.fa
# For ATAC-seq: use full peak regions (typically 200-500bp)
bedtools getfasta -fi hg38.fa -bed atac_peaks_clean.bed -fo atac_peaks.fa
Critical: For TF ChIP-seq, always center on the summit (column 10 in narrowPeak format) and use a narrow window (150-250bp). Using the full peak region dilutes the motif signal because TF binding sites are concentrated at the summit. For histone ChIP-seq, use the full peak or a broader window because histone marks cover larger domains.
Subsampling Large Peak Sets
For peak sets larger than 50,000, subsample the top peaks by signal strength:
# Sort by signalValue (column 7) descending, take top 10,000
sort -k7,7nr summits_clean.bed | head -10000 > top10k_summits.bed
bedtools getfasta -fi hg38.fa -bed top10k_summits.bed -fo top10k_summits.fa
This improves speed without sacrificing sensitivity, as the strongest peaks contain the most consistent motif instances.
Part 1: HOMER findMotifsGenome
HOMER (Heinz et al. 2010) performs both de novo motif discovery and known motif enrichment in a single command. It is the most widely used tool for ChIP-seq motif analysis.
1a. Basic Usage
findMotifsGenome.pl peaks.bed hg38 homer_output/ \
-size 200 \
-mask \
-p 8 \
-preparsedDir /data/homer_preparsed/
Key parameters:
| Parameter | Value | Rationale |
|---|---|---|
-size | 200 (TF ChIP-seq) | Window around peak center; 200bp captures typical TF binding site + flanking context |
-size | given (histone ChIP-seq) | Use actual peak boundaries for broad marks |
-mask | always include | Mask repeat sequences to avoid spurious repeat-derived motifs |
-p | 8 (or available cores) | Parallel threads for speed |
-preparsedDir | reusable directory | Cache parsed genome for repeated runs |
-bg | background.bed (optional) | Custom background regions; default uses matched GC regions from genome |
-mknown | motifs.motif (optional) | Test specific known motifs in addition to default database |
-len | 8,10,12 (default) | Motif lengths to search; default covers most TF motifs |
1b. Output Structure
HOMER produces a structured output directory:
homer_output/
homerResults.html # De novo motif results (interactive HTML)
knownResults.html # Known motif enrichment results
homerResults/
motif1.motif # Position weight matrix for each de novo motif
motif2.motif
...
knownResults/
known1.motif # Matched known motif PWMs
...
homerMotifs.all.motifs # All de novo motifs in one file
seq.autonorm.tsv # Normalization statistics
1c. HOMER for ATAC-seq Peaks
ATAC-seq peaks represent accessible chromatin, not specific TF binding. Motif analysis on ATAC peaks reveals which TFs occupy accessible regions:
findMotifsGenome.pl atac_peaks.bed hg38 homer_atac_output/ \
-size given \
-mask \
-p 8
Use -size given for ATAC-seq to analyze the full accessible region rather than a fixed window.
Part 2: MEME-ChIP Suite
The MEME Suite (Bailey et al. 2009) provides a complementary approach with different algorithms and additional capabilities for motif spacing analysis and scanning.
2a. MEME-ChIP: All-in-One Pipeline
MEME-ChIP runs five tools sequentially: MEME (de novo discovery), DREME (short motif discovery), CentriMo (motif centrality), AME (known motif enrichment), and SpaMo (motif spacing).
meme-chip \
-meme-maxw 30 \
-meme-nmotifs 10 \
-meme-minw 6 \
-db JASPAR2024_CORE_vertebrates.meme \
-o memechip_output/ \
summits.fa
Required input: FASTA file of peak sequences (not BED coordinates -- MEME Suite works on sequences, not genomic intervals).
Key parameters:
| Parameter | Value | Purpose |
|---|---|---|
-meme-maxw | 30 | Maximum motif width; 30 covers most TF motifs |
-meme-minw | 6 | Minimum motif width |
-meme-nmotifs | 10 | Number of de novo motifs to find |
-db | JASPAR file | Known motif database for enrichment testing |
-o | output directory | Results directory |
-meme-mod | zoops (default) | Zero or one occurrence per sequence; appropriate for ChIP-seq |
2b. Individual MEME Suite Tools
AME (Analysis of Motif Enrichment): Known motif enrichment testing, analogous to HOMER