Assess ENCODE Data Quality

When to Use

User asks about data quality, QC metrics, or whether an experiment is reliable
User wants to filter experiments by quality (FRiP, NSC, RSC, NRF, IDR, TSS enrichment)
User asks "is this experiment good enough?" or "should I use this data?"
User needs to interpret ENCODE audit flags (ERROR, NOT_COMPLIANT, WARNING)
User wants to compare quality across multiple experiments
User is selecting high-quality experiments for a meta-analysis or aggregation

Help the user evaluate whether ENCODE experiments meet quality standards for their analysis. Quality assessment is not a single-metric exercise — it requires integrating multiple orthogonal measures in the context of the specific assay, biological system, and analytical goals.

Literature Foundation

#	Reference	Key Contribution
1	Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~3,500 cit)	ENCODE/modENCODE ChIP-seq guidelines; defined NSC, RSC, NRF, FRiP thresholds
2	ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit)	ENCODE Phase 3; expanded quality standards to new assays, defined cCRE registry
3	Buenrostro et al. 2013, Nat Methods, DOI:10.1038/nmeth.2688 (~7,000 cit)	Introduced ATAC-seq; established fragment size and TSS enrichment as key QC
4	Ou et al. 2018, BMC Genomics, DOI:10.1186/s12864-018-4559-3	ATACseqQC R package; systematic quality metrics for ATAC-seq
5	Conesa et al. 2016, Genome Biol, DOI:10.1186/s13059-016-0881-8 (~2,363 cit)	RNA-seq best practices survey; defined mapping rate, rRNA, gene body coverage
6	Foox et al. 2021, Genome Biol, DOI:10.1186/s13059-021-02529-2	SEQC2 EpiQC consortium; multi-platform WGBS benchmarking
7	Yardimci et al. 2019, Genome Biol, DOI:10.1186/s13059-019-1658-7	Hi-C quality measures; cis/trans ratio, distance-dependent decay, resolution
8	Skene & Henikoff 2017, eLife, DOI:10.7554/eLife.21856 (~1,800 cit)	CUT&RUN method; established spike-in normalization and low-background QC
9	Kaya-Okur et al. 2019, Nat Commun, DOI:10.1038/s41467-019-09982-5 (~1,200 cit)	CUT&Tag method; tagmentation-based profiling with distinct QC profile
10	Li et al. 2011, Ann Appl Stat, DOI:10.1214/11-AOAS466 (~1,500 cit)	Irreproducible Discovery Rate (IDR); principled replicate concordance
11	Hitz et al. 2023, Nucleic Acids Res, DOI:10.1093/nar/gkad243	ENCODE uniform processing pipelines; standardized QC across all assays
12	Nordin et al. 2023, Genome Biol, DOI:10.1186/s13059-023-03027-3	CUT&RUN suspect list; identified artifact-prone regions specific to CUT&RUN/CUT&Tag
13	Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit)	ENCODE Blacklist v2; artifact regions to exclude from all analyses

Step 1: Retrieve Experiment Details and Audit Status

Use encode_get_experiment with the accession to get full metadata including:

Audit status (ERROR, NOT_COMPLIANT, WARNING, INTERNAL_ACTION)
Replicate information (biological and technical replicates)
Pipeline and analysis details (which ENCODE uniform pipeline was used)
Quality metrics embedded in file objects

encode_get_experiment(accession="ENCSR...")

For batch assessment across multiple experiments:

encode_search_experiments(assay_title="...", organ="...", limit=50)
# Then iterate through results checking audit flags

Step 2: Interpret ENCODE Audit Flags

ENCODE audits are generated by automated validators during the ENCODE uniform processing pipeline (Hitz et al. 2023). They flag experiments by severity:

Level	Meaning	Action
ERROR	Critical issues — data may be unreliable	Avoid using unless no alternative exists. Document thoroughly if used.
NOT_COMPLIANT	Does not meet current ENCODE standards	Usable with caveats. Check which specific standard is violated.
WARNING	Minor issues detected	Generally safe. Document the specific warning.
INTERNAL_ACTION	DCC processing notes	Usually not a concern for external users.

Common audit categories and what they mean:

Audit Category	What It Checks
`replicate concordance`	IDR or correlation between biological replicates
`library complexity`	NRF, PBC1, PBC2 — whether library is saturated
`read depth`	Whether minimum depth thresholds are met
`control quality`	Whether input/IgG control is adequate
`mapping quality`	Alignment rate and uniquely mapped fraction
`peak calling`	Whether peaks were called successfully, FRiP
`antibody validation`	Whether antibody meets ENCODE standards

Present every audit flag to the user and explain each one. A single ERROR audit does not automatically disqualify an experiment — context matters.

Step 3: Evaluate ChIP-seq Quality (Landt et al. 2012)

The ENCODE ChIP-seq guidelines (Landt et al. 2012) established the foundational metrics still used today. These were developed from analysis of hundreds of ChIP-seq experiments and reflect empirically-derived thresholds.

Core Metrics

Metric	Threshold	Concern	What It Measures	Why It Matters
FRiP	≥1% (TF), ≥5% (histone)	Below threshold	Fraction of reads in peaks	Signal enrichment. Very low FRiP means most reads are background. TF ChIP typically has lower FRiP than broad histone marks.
NSC	>1.05	≤1.05	Normalized strand cross-correlation	Signal-to-noise ratio. Computed from strand shift analysis. Values near 1.0 indicate no enrichment.
RSC	>0.8	≤0.8	Relative strand cross-correlation	Signal relative to phantom peak. More robust than NSC for shallow libraries.
NRF	≥0.8	<0.8	Non-redundant fraction (unique/total)	Library complexity. Low NRF = excessive PCR duplication = wasted sequencing.
PBC1	≥0.8	<0.5	PCR bottleneck coefficient 1	N1/Nd: fraction of locations with exactly 1 read. More sensitive than NRF at high depth.
PBC2	≥3	<1	PCR bottleneck coefficient 2	N1/N2: ratio of 1-read to 2-read locations. <1 indicates severe bottleneck.

Read Depth Requirements

Target Type	Minimum per Replicate	Recommended	Notes
Transcription factor	10M uniquely mapped	20M	Narrow peaks, need depth for detection
Broad histone mark (H3K27me3, H3K9me3, H3K36me3)	20M uniquely mapped	45M	Broad domains require more reads
Narrow histone mark (H3K4me3, H3K27ac)	20M uniquely mapped	20M	Sharp peaks, similar to TF
Input/IgG control	10M uniquely mapped	Match IP depth	Should match or exceed IP library depth

IDR Analysis (Li et al. 2011)

The Irreproducible Discovery Rate provides principled assessment of replicate concordance:

IDR Comparison	Expected	Concern	Interpretation
Nt (true replicates)	≥50% of Np	<50% Np	Low concordance between biological replicates
Np (pooled pseudoreplicates)	Reference set	—	Represents total discoverable peaks
Self-consistency (Ns)	≥50% of Np	<50% Np	Individual replicate quality
Rescue ratio (Np/max(Nt,Ns))	<2	>2	High ratio = one replicate much weaker

Key insight: IDR thresholded peaks represent peaks passing replicate concordance analysis. Pseudoreplicated peaks = single-replicate fallback (lower confidence). Optimal IDR peaks from pooled data = most complete peak set.

Antibody Validation

ENCODE requires characterization for every antibody:

Primary: IP followed by mass spectrometry or immunoprecipitation-western
Secondary: At least one of: knockdown/knockout, motif enrichment, genomic annotation enrichment
Check the antibody_lot_reviews field in

quality-assessment

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes