Assess ENCODE Data Quality
When to Use
- User asks about data quality, QC metrics, or whether an experiment is reliable
- User wants to filter experiments by quality (FRiP, NSC, RSC, NRF, IDR, TSS enrichment)
- User asks "is this experiment good enough?" or "should I use this data?"
- User needs to interpret ENCODE audit flags (ERROR, NOT_COMPLIANT, WARNING)
- User wants to compare quality across multiple experiments
- User is selecting high-quality experiments for a meta-analysis or aggregation
Help the user evaluate whether ENCODE experiments meet quality standards for their analysis. Quality assessment is not a single-metric exercise — it requires integrating multiple orthogonal measures in the context of the specific assay, biological system, and analytical goals.
Literature Foundation
| # | Reference | Key Contribution |
|---|---|---|
| 1 | Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~3,500 cit) | ENCODE/modENCODE ChIP-seq guidelines; defined NSC, RSC, NRF, FRiP thresholds |
| 2 | ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | ENCODE Phase 3; expanded quality standards to new assays, defined cCRE registry |
| 3 | Buenrostro et al. 2013, Nat Methods, DOI:10.1038/nmeth.2688 (~7,000 cit) | Introduced ATAC-seq; established fragment size and TSS enrichment as key QC |
| 4 | Ou et al. 2018, BMC Genomics, DOI:10.1186/s12864-018-4559-3 | ATACseqQC R package; systematic quality metrics for ATAC-seq |
| 5 | Conesa et al. 2016, Genome Biol, DOI:10.1186/s13059-016-0881-8 (~2,363 cit) | RNA-seq best practices survey; defined mapping rate, rRNA, gene body coverage |
| 6 | Foox et al. 2021, Genome Biol, DOI:10.1186/s13059-021-02529-2 | SEQC2 EpiQC consortium; multi-platform WGBS benchmarking |
| 7 | Yardimci et al. 2019, Genome Biol, DOI:10.1186/s13059-019-1658-7 | Hi-C quality measures; cis/trans ratio, distance-dependent decay, resolution |
| 8 | Skene & Henikoff 2017, eLife, DOI:10.7554/eLife.21856 (~1,800 cit) | CUT&RUN method; established spike-in normalization and low-background QC |
| 9 | Kaya-Okur et al. 2019, Nat Commun, DOI:10.1038/s41467-019-09982-5 (~1,200 cit) | CUT&Tag method; tagmentation-based profiling with distinct QC profile |
| 10 | Li et al. 2011, Ann Appl Stat, DOI:10.1214/11-AOAS466 (~1,500 cit) | Irreproducible Discovery Rate (IDR); principled replicate concordance |
| 11 | Hitz et al. 2023, Nucleic Acids Res, DOI:10.1093/nar/gkad243 | ENCODE uniform processing pipelines; standardized QC across all assays |
| 12 | Nordin et al. 2023, Genome Biol, DOI:10.1186/s13059-023-03027-3 | CUT&RUN suspect list; identified artifact-prone regions specific to CUT&RUN/CUT&Tag |
| 13 | Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit) | ENCODE Blacklist v2; artifact regions to exclude from all analyses |
Step 1: Retrieve Experiment Details and Audit Status
Use encode_get_experiment with the accession to get full metadata including:
- Audit status (ERROR, NOT_COMPLIANT, WARNING, INTERNAL_ACTION)
- Replicate information (biological and technical replicates)
- Pipeline and analysis details (which ENCODE uniform pipeline was used)
- Quality metrics embedded in file objects
encode_get_experiment(accession="ENCSR...")
For batch assessment across multiple experiments:
encode_search_experiments(assay_title="...", organ="...", limit=50)
# Then iterate through results checking audit flags
Step 2: Interpret ENCODE Audit Flags
ENCODE audits are generated by automated validators during the ENCODE uniform processing pipeline (Hitz et al. 2023). They flag experiments by severity:
| Level | Meaning | Action |
|---|---|---|
| ERROR | Critical issues — data may be unreliable | Avoid using unless no alternative exists. Document thoroughly if used. |
| NOT_COMPLIANT | Does not meet current ENCODE standards | Usable with caveats. Check which specific standard is violated. |
| WARNING | Minor issues detected | Generally safe. Document the specific warning. |
| INTERNAL_ACTION | DCC processing notes | Usually not a concern for external users. |
Common audit categories and what they mean:
| Audit Category | What It Checks |
|---|---|
replicate concordance | IDR or correlation between biological replicates |
library complexity | NRF, PBC1, PBC2 — whether library is saturated |
read depth | Whether minimum depth thresholds are met |
control quality | Whether input/IgG control is adequate |
mapping quality | Alignment rate and uniquely mapped fraction |
peak calling | Whether peaks were called successfully, FRiP |
antibody validation | Whether antibody meets ENCODE standards |
Present every audit flag to the user and explain each one. A single ERROR audit does not automatically disqualify an experiment — context matters.
Step 3: Evaluate ChIP-seq Quality (Landt et al. 2012)
The ENCODE ChIP-seq guidelines (Landt et al. 2012) established the foundational metrics still used today. These were developed from analysis of hundreds of ChIP-seq experiments and reflect empirically-derived thresholds.
Core Metrics
| Metric | Threshold | Concern | What It Measures | Why It Matters |
|---|---|---|---|---|
| FRiP | ≥1% (TF), ≥5% (histone) | Below threshold | Fraction of reads in peaks | Signal enrichment. Very low FRiP means most reads are background. TF ChIP typically has lower FRiP than broad histone marks. |
| NSC | >1.05 | ≤1.05 | Normalized strand cross-correlation | Signal-to-noise ratio. Computed from strand shift analysis. Values near 1.0 indicate no enrichment. |
| RSC | >0.8 | ≤0.8 | Relative strand cross-correlation | Signal relative to phantom peak. More robust than NSC for shallow libraries. |
| NRF | ≥0.8 | <0.8 | Non-redundant fraction (unique/total) | Library complexity. Low NRF = excessive PCR duplication = wasted sequencing. |
| PBC1 | ≥0.8 | <0.5 | PCR bottleneck coefficient 1 | N1/Nd: fraction of locations with exactly 1 read. More sensitive than NRF at high depth. |
| PBC2 | ≥3 | <1 | PCR bottleneck coefficient 2 | N1/N2: ratio of 1-read to 2-read locations. <1 indicates severe bottleneck. |
Read Depth Requirements
| Target Type | Minimum per Replicate | Recommended | Notes |
|---|---|---|---|
| Transcription factor | 10M uniquely mapped | 20M | Narrow peaks, need depth for detection |
| Broad histone mark (H3K27me3, H3K9me3, H3K36me3) | 20M uniquely mapped | 45M | Broad domains require more reads |
| Narrow histone mark (H3K4me3, H3K27ac) | 20M uniquely mapped | 20M | Sharp peaks, similar to TF |
| Input/IgG control | 10M uniquely mapped | Match IP depth | Should match or exceed IP library depth |
IDR Analysis (Li et al. 2011)
The Irreproducible Discovery Rate provides principled assessment of replicate concordance:
| IDR Comparison | Expected | Concern | Interpretation |
|---|---|---|---|
| Nt (true replicates) | ≥50% of Np | <50% Np | Low concordance between biological replicates |
| Np (pooled pseudoreplicates) | Reference set | — | Represents total discoverable peaks |
| Self-consistency (Ns) | ≥50% of Np | <50% Np | Individual replicate quality |
| Rescue ratio (Np/max(Nt,Ns)) | <2 | >2 | High ratio = one replicate much weaker |
Key insight: IDR thresholded peaks represent peaks passing replicate concordance analysis. Pseudoreplicated peaks = single-replicate fallback (lower confidence). Optimal IDR peaks from pooled data = most complete peak set.
Antibody Validation
ENCODE requires characterization for every antibody:
- Primary: IP followed by mass spectrometry or immunoprecipitation-western
- Secondary: At least one of: knockdown/knockout, motif enrichment, genomic annotation enrichment
- Check the
antibody_lot_reviewsfield in