Motif Analysis of ENCODE Peak Data

When to Use

User wants to discover transcription factor binding motifs in ChIP-seq or ATAC-seq peaks
User asks about "motif enrichment", "HOMER", "MEME", or "de novo motif discovery"
User needs to validate ChIP-seq targets by checking if the expected motif is enriched
User wants to find co-binding partners or co-factor motifs in peak regions
Example queries: "find motifs in my CTCF peaks", "run HOMER on ATAC-seq peaks", "what TFs co-bind with p300 in liver?"

Help the user perform de novo and known motif enrichment analysis on ENCODE ChIP-seq and ATAC-seq peaks. Motif analysis serves two critical purposes: (1) validating that ChIP-seq experiments pulled down the expected transcription factor, and (2) discovering co-regulatory partners that co-bind with the target factor. This skill covers the two major tool suites -- HOMER and MEME Suite -- from input preparation through result interpretation.

Literature Foundation

Reference	Journal	Key Contribution	DOI	Citations
Heinz et al. (2010)	Molecular Cell	HOMER: Simple combinations of lineage-determining TFs prime cis-regulatory elements; introduced findMotifsGenome.pl for ChIP-seq motif analysis	10.1016/j.molcel.2010.05.004	~6,000
Bailey et al. (2009)	Nucleic Acids Research	MEME Suite: comprehensive tools for motif discovery (MEME), enrichment (AME), scanning (FIMO), and spacing (SpaMo)	10.1093/nar/gkp335	~2,500
Bailey & Elkan (1994)	ISMB	Foundational MEME algorithm: expectation maximization for discovering ungapped motifs in biopolymers	PMID: 7584402	~4,000
Machanick & Bailey (2011)	Bioinformatics	MEME-ChIP: all-in-one motif analysis pipeline optimized for large ChIP-seq datasets	10.1093/bioinformatics/btr189	~1,800
Fornes et al. (2020)	Nucleic Acids Research	JASPAR 2020: curated, non-redundant TF binding profile database; standard reference for known motifs	10.1093/nar/gkz1001	~2,200
Amemiya et al. (2019)	Scientific Reports	ENCODE Blacklist: regions producing artifact signal that can generate spurious motif hits	10.1038/s41598-019-45839-z	~1,372

Prerequisites: Input Preparation

Obtaining ENCODE Peaks

Search for and download ChIP-seq or ATAC-seq peaks:

encode_search_experiments(
    assay_title="TF ChIP-seq",
    target="CTCF",
    organ="pancreas",
    biosample_type="tissue"
)

encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38",
    preferred_default=True
)

encode_download_files(
    file_accessions=["ENCFF..."],
    download_dir="/data/motif_analysis/"
)

Preparing Sequences from Peaks

Motif analysis requires DNA sequences, not just genomic coordinates. Extract sequences centered on peak summits:

# For TF ChIP-seq: extract summit +/- 100bp (200bp window)
awk 'BEGIN{OFS="\t"} {summit=$2+$10; print $1, summit-100, summit+100, $4, $5}' \
    peaks.narrowPeak > summits_200bp.bed

# Remove blacklisted regions (Amemiya et al. 2019)
bedtools intersect -a summits_200bp.bed \
    -b hg38-blacklist.v2.bed -v > summits_clean.bed

# Extract FASTA sequences (requires genome FASTA)
bedtools getfasta -fi hg38.fa -bed summits_clean.bed -fo summits.fa

# For ATAC-seq: use full peak regions (typically 200-500bp)
bedtools getfasta -fi hg38.fa -bed atac_peaks_clean.bed -fo atac_peaks.fa

Critical: For TF ChIP-seq, always center on the summit (column 10 in narrowPeak format) and use a narrow window (150-250bp). Using the full peak region dilutes the motif signal because TF binding sites are concentrated at the summit. For histone ChIP-seq, use the full peak or a broader window because histone marks cover larger domains.

Subsampling Large Peak Sets

For peak sets larger than 50,000, subsample the top peaks by signal strength:

# Sort by signalValue (column 7) descending, take top 10,000
sort -k7,7nr summits_clean.bed | head -10000 > top10k_summits.bed
bedtools getfasta -fi hg38.fa -bed top10k_summits.bed -fo top10k_summits.fa

This improves speed without sacrificing sensitivity, as the strongest peaks contain the most consistent motif instances.

Part 1: HOMER findMotifsGenome

HOMER (Heinz et al. 2010) performs both de novo motif discovery and known motif enrichment in a single command. It is the most widely used tool for ChIP-seq motif analysis.

1a. Basic Usage

findMotifsGenome.pl peaks.bed hg38 homer_output/ \
    -size 200 \
    -mask \
    -p 8 \
    -preparsedDir /data/homer_preparsed/

Key parameters:

Parameter	Value	Rationale
`-size`	200 (TF ChIP-seq)	Window around peak center; 200bp captures typical TF binding site + flanking context
`-size`	given (histone ChIP-seq)	Use actual peak boundaries for broad marks
`-mask`	always include	Mask repeat sequences to avoid spurious repeat-derived motifs
`-p`	8 (or available cores)	Parallel threads for speed
`-preparsedDir`	reusable directory	Cache parsed genome for repeated runs
`-bg`	background.bed (optional)	Custom background regions; default uses matched GC regions from genome
`-mknown`	motifs.motif (optional)	Test specific known motifs in addition to default database
`-len`	8,10,12 (default)	Motif lengths to search; default covers most TF motifs

1b. Output Structure

HOMER produces a structured output directory:

homer_output/
    homerResults.html         # De novo motif results (interactive HTML)
    knownResults.html         # Known motif enrichment results
    homerResults/
        motif1.motif          # Position weight matrix for each de novo motif
        motif2.motif
        ...
    knownResults/
        known1.motif          # Matched known motif PWMs
        ...
    homerMotifs.all.motifs    # All de novo motifs in one file
    seq.autonorm.tsv          # Normalization statistics

1c. HOMER for ATAC-seq Peaks

ATAC-seq peaks represent accessible chromatin, not specific TF binding. Motif analysis on ATAC peaks reveals which TFs occupy accessible regions:

findMotifsGenome.pl atac_peaks.bed hg38 homer_atac_output/ \
    -size given \
    -mask \
    -p 8

Use -size given for ATAC-seq to analyze the full accessible region rather than a fixed window.

Part 2: MEME-ChIP Suite

The MEME Suite (Bailey et al. 2009) provides a complementary approach with different algorithms and additional capabilities for motif spacing analysis and scanning.

2a. MEME-ChIP: All-in-One Pipeline

MEME-ChIP runs five tools sequentially: MEME (de novo discovery), DREME (short motif discovery), CentriMo (motif centrality), AME (known motif enrichment), and SpaMo (motif spacing).

meme-chip \
    -meme-maxw 30 \
    -meme-nmotifs 10 \
    -meme-minw 6 \
    -db JASPAR2024_CORE_vertebrates.meme \
    -o memechip_output/ \
    summits.fa

Required input: FASTA file of peak sequences (not BED coordinates -- MEME Suite works on sequences, not genomic intervals).

Key parameters:

Parameter	Value	Purpose
`-meme-maxw`	30	Maximum motif width; 30 covers most TF motifs
`-meme-minw`	6	Minimum motif width
`-meme-nmotifs`	10	Number of de novo motifs to find
`-db`	JASPAR file	Known motif database for enrichment testing
`-o`	output directory	Results directory
`-meme-mod`	zoops (default)	Zero or one occurrence per sequence; appropriate for ChIP-seq

2b. Individual MEME Suite Tools

AME (Analysis of Motif Enrichment): Known motif enrichment testing, analogous to HOMER

motif-analysis

How to add

Drop this on your repo README

Related skills

xlsx

how-it-works

mem-search

weekly-digests

Get new Dados e Análise skills every Monday