Batch Analysis of ENCODE Experiments
When to Use
- User wants to process, compare, or QC multiple ENCODE experiments simultaneously
- User asks about "batch analysis", "bulk processing", "experiment comparison table", or "multi-sample QC"
- User needs to screen 5+ experiments for quality before analysis
- User wants a summary report or comparison table across many experiments
- Example queries: "QC all H3K27ac experiments in liver", "compare quality across 10 ChIP-seq datasets", "batch download and summarize my experiment collection"
Help the user perform systematic batch operations across multiple ENCODE experiments. When working with 5 or more experiments -- common in cross-tissue comparisons, multi-mark epigenomic profiling, or large-scale data collection -- individual experiment-by-experiment workflows become impractical and error-prone. This skill covers batch discovery, quality screening, download management, pairwise comparison, and report generation using the ENCODE MCP tools.
Literature Foundation
| Reference | Journal | Key Contribution | DOI | Citations |
|---|---|---|---|---|
| ENCODE Project Consortium (2020) | Nature | Expanded encyclopedia of 926,535 candidate cis-regulatory elements across 1,698 cell types; framework for large-scale integrative analysis | 10.1038/s41586-020-2493-4 | ~2,000 |
| Hitz et al. (2023) | Nucleic Acids Research | The ENCODE Uniform Processing Pipelines: standardized processing enables large-scale batch comparisons | 10.1093/nar/gkac1067 | ~50 |
| Landt et al. (2012) | Genome Research | ChIP-seq guidelines of ENCODE/modENCODE: QC metrics (FRiP, NSC, RSC, NRF) for batch quality assessment | 10.1101/gr.136184.111 | ~4,000 |
| Leek et al. (2010) | Nature Reviews Genetics | Tackling batch effects: detection via PCA, correction via ComBat/SVA; essential for multi-lab analyses | 10.1038/nrg2825 | ~1,200 |
| Amemiya et al. (2019) | Scientific Reports | ENCODE Blacklist: artifact regions to exclude across all experiments in batch analyses | 10.1038/s41598-019-45839-z | ~1,372 |
Part 1: Batch Discovery and QC Screening
1a. Systematic Experiment Discovery
Start with encode_get_facets to understand the scope of available data before committing to a batch:
encode_get_facets(
assay_title="Histone ChIP-seq",
organ="pancreas"
)
This returns counts by target, biosample type, lab, and status. Use facets to estimate how many experiments match your criteria and identify potential batch variables (multiple labs, multiple biosample types).
Then search for all candidate experiments:
results = encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
biosample_type="tissue",
organism="Homo sapiens",
limit=100
)
1b. Building the Experiment Table
Create a structured table of all candidate experiments for review:
For each experiment in search results:
encode_get_experiment(accession="ENCSR...")
Collect into table:
| Accession | Target | Biosample | Lab | Replicates | Audit Status | Date Released |
Key fields to extract:
- Accession
- Assay title
- Target (for ChIP-seq)
- Biosample term name
- Biosample type (tissue, cell line, primary cell)
- Lab
- Number of biological replicates
- Audit level (ERROR, NOT_COMPLIANT, WARNING)
- Assembly
- Date released
- Pipeline version
1c. Quality Screening Criteria
Apply the ENCODE quality standards (Landt et al. 2012) to filter experiments:
Mandatory exclusion (remove from batch):
| Criterion | Threshold | Rationale |
|---|---|---|
| Audit level = ERROR | Exclude | Fundamental data quality failure |
| Assembly mismatch | Exclude if mixed | Cannot combine GRCh38 with hg19 |
| 0 replicates | Exclude | No biological replication |
Quality flags (include with notation):
| Criterion | Threshold | Action |
|---|---|---|
| Audit level = NOT_COMPLIANT | Flag | Include but note in report |
| Single replicate | Flag | Reduced statistical power; note |
| FRiP < 1% (ChIP-seq) | Flag | Low enrichment; may lack signal |
| NRF < 0.8 | Flag | Low library complexity |
| NSC < 1.05 | Flag | Low signal-to-noise |
| RSC < 0.8 | Flag | Low relative strand correlation |
Quality tiers for batch analysis:
| Tier | Criteria | Use Case |
|---|---|---|
| Tier 1 | No audits, 2+ replicates, all QC pass | Gold standard; use for primary analysis |
| Tier 2 | WARNING audits only, 2+ replicates | Acceptable; include with documentation |
| Tier 3 | NOT_COMPLIANT audits or 1 replicate | Use only if Tier 1/2 insufficient; flag heavily |
| Exclude | ERROR audits or 0 replicates | Never include |
1d. Identifying Batch Variables
Before proceeding, identify potential confounders across the experiment collection:
Group experiments by:
- Lab (different labs = potential batch effect)
- Date released (>1 year gap = potential processing differences)
- Pipeline version (different versions = different peak calls)
- Sequencing platform (Illumina vs other)
- Library prep method
If all experiments of one condition come from one lab and all experiments of another condition come from a different lab, the design is confounded. This cannot be corrected computationally (Leek et al. 2010). Document this limitation.
Part 2: Batch Download
2a. Dry Run First
Always preview downloads before committing:
encode_batch_download(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
file_format="bigWig",
output_type="fold change over control",
assembly="GRCh38",
download_dir="/data/encode_batch/",
preferred_default=True,
dry_run=True,
limit=100
)
The dry run returns:
- Number of files that would be downloaded
- Total estimated size
- File list with accessions and sizes
Review before proceeding: Check that the total size is manageable and that no unexpected files are included.
2b. Organizing Downloads
Choose an organization strategy based on your analysis plan:
| organize_by | Directory Structure | Best For |
|---|---|---|
flat | All files in one directory | Small batches (<20 files) |
experiment | ENCSR.../filename | Per-experiment analysis workflows |
format | bigWig/filename | Downstream tools that expect format-grouped input |
experiment_format | ENCSR.../bigWig/filename | Large multi-format batches |
encode_batch_download(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
file_format="bigWig",
output_type="fold change over control",
assembly="GRCh38",
download_dir="/data/encode_batch/",
organize_by="experiment",
preferred_default=True,
verify_md5=True,
dry_run=False,
limit=100
)
2c. Downloading Multiple File Types
For comprehensive analysis, download multiple file types per experiment:
# Signal tracks for visualization and correlation
encode_batch_download(
...,
file_format="bigWig",
output_type="fold change over control",
download_dir="/data/encode_batch/signal/",
dry_run=False
)
# Peak calls for overlap and annotation
encode_batch_download(
...,
file_format="bed",
output_type="IDR thresholded peaks",
download_dir="/data/encode_batch/peaks/",
dry_run=False
)
2d. Handling Download Failures
For large batches, some downloads may fail due to network issues or temporary server errors. The download results report success/failure per file.
Strategy for failures:
1. Note failed file accessions from download results
2. Wait 5 minutes (transient server issues)
3. Retry failed files individually:
encode_download_fil