Search ENCODE Data
When to Use
- User wants to find ENCODE experiments matching specific criteria (assay, organ, cell type, target)
- User asks "what ENCODE data exists for [tissue/target/assay]?"
- User wants to explore available data before downloading
- User needs to find specific file types (BED, BAM, bigWig) across experiments
- User wants to know how many experiments exist for a condition
- User asks about available assays, organisms, or biosamples in ENCODE
Help the user find ENCODE experiments and files. Use the appropriate tools based on what they need.
Search Strategy
-
Finding experiments: Use
encode_search_experimentswith filters:assay_title: "Histone ChIP-seq", "ATAC-seq", "RNA-seq", "TF ChIP-seq", "Hi-C", "CUT&RUN", "WGBS", etc.organ: "pancreas", "brain", "liver", "heart", "kidney", "lung", etc.biosample_type: "tissue", "cell line", "primary cell", "organoid"biosample_term_name: specific name like "GM12878", "HepG2", "K562"target: ChIP/CUT&RUN target like "H3K27me3", "H3K4me3", "CTCF", "p300"organism: "Homo sapiens" (default) or "Mus musculus"
-
Finding files across experiments: Use
encode_search_fileswhen the user wants specific file types from multiple experiments. -
Exploring available data: Use
encode_get_facetsto see counts of what exists before searching. Useencode_get_metadatato list valid filter values. -
Getting experiment details: Use
encode_get_experimentfor full metadata on a single experiment. Useencode_list_filesto see all files for one experiment.
Search Strategy Guide
Effective ENCODE searching follows a three-phase pattern: explore, search, refine. Jumping straight to a filtered search often produces empty results or misses relevant data.
Phase 1: Explore with Facets
Always start with encode_get_facets to understand what data exists. Facets return counts per filter value, so you can see immediately whether your target organ, assay, or biosample has data.
encode_get_facets(organ="pancreas")
-> Shows: Histone ChIP-seq (42), ATAC-seq (8), RNA-seq (15), TF ChIP-seq (6), ...
-> Also shows: biosample types, life stages, labs, replication types
This avoids the frustrating pattern of searching for data that does not exist. Facets may also reveal data you did not expect -- for example, CUT&RUN data where you only anticipated ChIP-seq, or organoid samples alongside tissue.
Phase 2: Validate Filter Values
Before searching, confirm that your filter values match ENCODE's controlled vocabulary. A mistyped assay name returns zero results with no error.
encode_get_metadata(metadata_type="assays")
-> Returns all valid assay_title values: "Histone ChIP-seq", "TF ChIP-seq", "ATAC-seq", ...
Available metadata types: assays, organisms, organs, biosample_types, file_formats, output_types, output_categories, assemblies, life_stages, replication_types, statuses, file_statuses.
Phase 3: Search and Refine
Start with broad filters and add constraints one at a time. If a search returns too many results (>100), add a filter. If it returns zero, remove the most restrictive filter first.
# Too broad: 2,400 results
encode_search_experiments(assay_title="Histone ChIP-seq")
# Add organ: 42 results
encode_search_experiments(assay_title="Histone ChIP-seq", organ="pancreas")
# Add target: 6 results
encode_search_experiments(assay_title="Histone ChIP-seq", organ="pancreas", target="H3K27ac")
Pitfalls & Edge Cases
- Wrong assay_title values: Assay names must match ENCODE's controlled vocabulary exactly. Run
encode_get_metadata(metadata_type="assays")first to discover valid values. For example, use "Histone ChIP-seq" not "ChIP-seq" or "H3K27ac ChIP". - Confusing biosample_term_name vs organ:
organis a broad anatomical system (e.g., "pancreas", "brain").biosample_term_nameis a specific cell or tissue name (e.g., "GM12878", "islet of Langerhans"). Useorganfor tissue-level exploration,biosample_term_namewhen you know the exact biosample. - Not exploring first: Always call
encode_get_facetsbefore searching to see what data exists. This avoids empty results and reveals unexpected data availability. For example, facets may show CUT&RUN data exists for your organ when you only expected ChIP-seq. - Mixing organisms: Human and mouse experiments use different assemblies (GRCh38 vs mm10) and cannot be directly compared. Always filter by
organismto avoid mixing species in results. - Expecting file-level results from experiment search:
encode_search_experimentsreturns experiments, not individual files. If the user wants specific BED or bigWig files, useencode_search_filesinstead withfile_formatandoutput_typefilters. - Searching for deprecated data: The default
status="released"is correct for most use cases. Archived or revoked experiments may have known quality issues. Only change status if the user explicitly needs historical data.
Gotchas
organ vs biosample_term_name vs biosample_type
These three filters address different levels of the biosample hierarchy. Using the wrong one produces unexpected results.
| Filter | What it means | Example values | When to use |
|---|---|---|---|
organ | Broad anatomical system | "pancreas", "brain", "heart", "liver" | Exploring all data for an organ system |
biosample_term_name | Exact biosample name | "GM12878", "K562", "islet of Langerhans", "HepG2" | You know the exact cell type or tissue name |
biosample_type | Category of biosample | "tissue", "cell line", "primary cell", "organoid", "in vitro differentiated cells" | Filtering by how the sample was obtained |
Common mistake: using biosample_term_name="pancreas" when you mean organ="pancreas". The term name "pancreas" matches whole-pancreas tissue samples only, missing islets, acinar cells, and other pancreatic substructures that are classified under the pancreas organ.
assay_title Must Match Exactly
ENCODE uses a controlled vocabulary for assay names. Common mistakes:
| Wrong | Correct |
|---|---|
| "ChIP-seq" | "Histone ChIP-seq" or "TF ChIP-seq" |
| "H3K27ac ChIP" | "Histone ChIP-seq" (with target="H3K27ac") |
| "ATAC" | "ATAC-seq" |
| "DNase" | "DNase-seq" |
| "Bisulfite-seq" | "WGBS" |
| "scRNA-seq" | "scRNA-seq" |
| "scATAC-seq" | "snATAC-seq" |
Always run encode_get_metadata(metadata_type="assays") to see valid values.
target Names Are Case-Sensitive
Histone mark targets use a specific capitalization pattern. Common mistakes:
| Wrong | Correct |
|---|---|
| "h3k27ac" | "H3K27ac" |
| "H3K27AC" | "H3K27ac" |
| "H3K4Me3" | "H3K4me3" |
| "ctcf" | "CTCF" |
Pattern: H3K{number}{modification} where the modification is lowercase ("me3", "ac", "me1"). Transcription factor targets use all-uppercase names ("CTCF", "POLR2A", "EP300").
Experiment Status Meanings
| Status | Meaning | When to use |
|---|---|---|
released | Passed ENCODE quality standards. Default and recommended. | Nearly all searches |
archived | Superseded by newer experiment or has known limitations. Data still accessible but not recommended. | Historical analysis, reproducing old studies |
revoked | Serious quality problems identified post-release. Should not be used for new analysis. | Only if investigating specific quality issues |
Common Filter Combinations
Ready-to-use filter combinations for common research questions:
| Research Question | Tool + Filters |
|---|---|
| All human heart data | encode_search_experiments(organ="heart") |
| Active enhancers in a tissue | encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="liver") |
| Active promoters in a tissue | encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K4me3", organ="liver") |
| Repressed chr |