Batch Analysis of ENCODE Experiments

When to Use

User wants to process, compare, or QC multiple ENCODE experiments simultaneously
User asks about "batch analysis", "bulk processing", "experiment comparison table", or "multi-sample QC"
User needs to screen 5+ experiments for quality before analysis
User wants a summary report or comparison table across many experiments
Example queries: "QC all H3K27ac experiments in liver", "compare quality across 10 ChIP-seq datasets", "batch download and summarize my experiment collection"

Help the user perform systematic batch operations across multiple ENCODE experiments. When working with 5 or more experiments -- common in cross-tissue comparisons, multi-mark epigenomic profiling, or large-scale data collection -- individual experiment-by-experiment workflows become impractical and error-prone. This skill covers batch discovery, quality screening, download management, pairwise comparison, and report generation using the ENCODE MCP tools.

Literature Foundation

Reference	Journal	Key Contribution	DOI	Citations
ENCODE Project Consortium (2020)	Nature	Expanded encyclopedia of 926,535 candidate cis-regulatory elements across 1,698 cell types; framework for large-scale integrative analysis	10.1038/s41586-020-2493-4	~2,000
Hitz et al. (2023)	Nucleic Acids Research	The ENCODE Uniform Processing Pipelines: standardized processing enables large-scale batch comparisons	10.1093/nar/gkac1067	~50
Landt et al. (2012)	Genome Research	ChIP-seq guidelines of ENCODE/modENCODE: QC metrics (FRiP, NSC, RSC, NRF) for batch quality assessment	10.1101/gr.136184.111	~4,000
Leek et al. (2010)	Nature Reviews Genetics	Tackling batch effects: detection via PCA, correction via ComBat/SVA; essential for multi-lab analyses	10.1038/nrg2825	~1,200
Amemiya et al. (2019)	Scientific Reports	ENCODE Blacklist: artifact regions to exclude across all experiments in batch analyses	10.1038/s41598-019-45839-z	~1,372

Part 1: Batch Discovery and QC Screening

1a. Systematic Experiment Discovery

Start with encode_get_facets to understand the scope of available data before committing to a batch:

encode_get_facets(
    assay_title="Histone ChIP-seq",
    organ="pancreas"
)

This returns counts by target, biosample type, lab, and status. Use facets to estimate how many experiments match your criteria and identify potential batch variables (multiple labs, multiple biosample types).

Then search for all candidate experiments:

results = encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    biosample_type="tissue",
    organism="Homo sapiens",
    limit=100
)

1b. Building the Experiment Table

Create a structured table of all candidate experiments for review:

For each experiment in search results:
    encode_get_experiment(accession="ENCSR...")

Collect into table:
| Accession | Target | Biosample | Lab | Replicates | Audit Status | Date Released |

Key fields to extract:

Accession
Assay title
Target (for ChIP-seq)
Biosample term name
Biosample type (tissue, cell line, primary cell)
Lab
Number of biological replicates
Audit level (ERROR, NOT_COMPLIANT, WARNING)
Assembly
Date released
Pipeline version

1c. Quality Screening Criteria

Apply the ENCODE quality standards (Landt et al. 2012) to filter experiments:

Mandatory exclusion (remove from batch):

Criterion	Threshold	Rationale
Audit level = ERROR	Exclude	Fundamental data quality failure
Assembly mismatch	Exclude if mixed	Cannot combine GRCh38 with hg19
0 replicates	Exclude	No biological replication

Quality flags (include with notation):

Criterion	Threshold	Action
Audit level = NOT_COMPLIANT	Flag	Include but note in report
Single replicate	Flag	Reduced statistical power; note
FRiP < 1% (ChIP-seq)	Flag	Low enrichment; may lack signal
NRF < 0.8	Flag	Low library complexity
NSC < 1.05	Flag	Low signal-to-noise
RSC < 0.8	Flag	Low relative strand correlation

Quality tiers for batch analysis:

Tier	Criteria	Use Case
Tier 1	No audits, 2+ replicates, all QC pass	Gold standard; use for primary analysis
Tier 2	WARNING audits only, 2+ replicates	Acceptable; include with documentation
Tier 3	NOT_COMPLIANT audits or 1 replicate	Use only if Tier 1/2 insufficient; flag heavily
Exclude	ERROR audits or 0 replicates	Never include

1d. Identifying Batch Variables

Before proceeding, identify potential confounders across the experiment collection:

Group experiments by:
    - Lab (different labs = potential batch effect)
    - Date released (>1 year gap = potential processing differences)
    - Pipeline version (different versions = different peak calls)
    - Sequencing platform (Illumina vs other)
    - Library prep method

If all experiments of one condition come from one lab and all experiments of another condition come from a different lab, the design is confounded. This cannot be corrected computationally (Leek et al. 2010). Document this limitation.

Part 2: Batch Download

2a. Dry Run First

Always preview downloads before committing:

encode_batch_download(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    file_format="bigWig",
    output_type="fold change over control",
    assembly="GRCh38",
    download_dir="/data/encode_batch/",
    preferred_default=True,
    dry_run=True,
    limit=100
)

The dry run returns:

Number of files that would be downloaded
Total estimated size
File list with accessions and sizes

Review before proceeding: Check that the total size is manageable and that no unexpected files are included.

2b. Organizing Downloads

Choose an organization strategy based on your analysis plan:

organize_by	Directory Structure	Best For
`flat`	All files in one directory	Small batches (<20 files)
`experiment`	`ENCSR.../filename`	Per-experiment analysis workflows
`format`	`bigWig/filename`	Downstream tools that expect format-grouped input
`experiment_format`	`ENCSR.../bigWig/filename`	Large multi-format batches

encode_batch_download(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    file_format="bigWig",
    output_type="fold change over control",
    assembly="GRCh38",
    download_dir="/data/encode_batch/",
    organize_by="experiment",
    preferred_default=True,
    verify_md5=True,
    dry_run=False,
    limit=100
)

2c. Downloading Multiple File Types

For comprehensive analysis, download multiple file types per experiment:

# Signal tracks for visualization and correlation
encode_batch_download(
    ...,
    file_format="bigWig",
    output_type="fold change over control",
    download_dir="/data/encode_batch/signal/",
    dry_run=False
)

# Peak calls for overlap and annotation
encode_batch_download(
    ...,
    file_format="bed",
    output_type="IDR thresholded peaks",
    download_dir="/data/encode_batch/peaks/",
    dry_run=False
)

2d. Handling Download Failures

For large batches, some downloads may fail due to network issues or temporary server errors. The download results report success/failure per file.

Strategy for failures:
1. Note failed file accessions from download results
2. Wait 5 minutes (transient server issues)
3. Retry failed files individually:
    encode_download_fil

batch-analysis

How to add

Drop this on your repo README

Related skills

xlsx

how-it-works

mem-search

weekly-digests

Get new Dados e Análise skills every Monday