Track ENCODE Experiments

When to Use

User wants to save/bookmark ENCODE experiments for later reference
User needs to build a collection of experiments for a project
User asks to "track", "save", or "bookmark" an experiment
User wants to manage citations and publications for ENCODE data
User needs to compare experiments for compatibility
User wants to export their experiment collection as CSV/TSV/JSON
User asks about data provenance (linking derived files to ENCODE sources)

Help the user manage their local collection of ENCODE experiments. This skill covers the full lifecycle of experiment management: discovery, tracking, annotation, citation, comparison, provenance, and export.

Tracking Capabilities

Track an experiment: Use encode_track_experiment to save experiment metadata, publications, and pipeline info locally.
- Automatically extracts GEO accessions and PMIDs from experiment metadata
- Fetches associated publications with authors, journal, DOI
- Stores 18 metadata fields per experiment (see schema below)
- Idempotent: re-tracking the same accession updates metadata without creating duplicates
View tracked collection: Use encode_list_tracked to see all tracked experiments. Filter by assay, organism, or organ.
Get citations: Use encode_get_citations to export publication data.
- "json": Structured data
- "bibtex": For LaTeX/reference managers
- "ris": For Endnote, Zotero, Mendeley
Compare experiments: Use encode_compare_experiments to check if two experiments are compatible for combined analysis (same organism, assembly, assay, biosample, etc.).
Collection overview: Use encode_summarize_collection for grouped statistics across your tracked experiments.
Export data: Use encode_export_data to export tracked experiments as CSV, TSV, or JSON for use in R, pandas, Excel.

Stored Metadata

When you track an experiment, the following fields are captured from the ENCODE Portal API and stored locally:

Field	Description	Example
`accession`	ENCODE accession (primary key)	ENCSR123ABC
`assay_title`	Assay type	Histone ChIP-seq
`target`	Antibody target (ChIP/eCLIP)	H3K27ac-human
`biosample_summary`	Full biosample description	pancreas tissue male adult (54 years)
`organism`	Species	Homo sapiens
`organ`	Organ or tissue of origin	pancreas
`biosample_type`	Biosample classification	tissue, primary cell, cell line
`status`	ENCODE release status	released
`date_released`	Portal release date	2020-07-15
`description`	Experiment description (from PI)	H3K27ac ChIP-seq on human pancreatic islets
`lab`	Submitting laboratory	/labs/bradley-bernstein/
`award`	Funding award	/awards/U01HG007610/
`assembly`	Genome assembly	GRCh38
`replication_type`	Replicate strategy	isogenic, anisogenic
`life_stage`	Developmental stage	adult, embryonic, child
`url`	ENCODE Portal URL	https://www.encodeproject.org/experiments/ENCSR123ABC/
`notes`	User-provided notes	H3K27ac reference for islet enhancer study
`raw_metadata`	Full JSON from API (up to 512KB)	(stored for future queries)

Additionally, the tracker stores timestamps (tracked_at, updated_at) for audit trail purposes.

SQLite Schema Overview

The tracker uses a local SQLite database with WAL journal mode and foreign keys enabled. The schema consists of six tables:

tracked_experiments -- One row per ENCODE experiment. The accession column is the primary key. Indexes on assay_title, organism, and organ for fast filtered queries.

publications -- Publications linked to experiments. Stores PMID, DOI, title, authors (first 10), journal, year, abstract. Unique constraint on (experiment_accession, pmid) prevents duplicates.

pipeline_info -- ENCODE uniform processing pipeline details. Stores pipeline title, version, software list (as JSON array), and analysis status.

quality_metrics -- Per-file quality metrics from ENCODE audits. Stores file accession, metric type, and metric data (as JSON).

derived_files -- User-created files derived from ENCODE data. Stores file path, source accessions (as JSON array), tool used, parameters, and description. This is the backbone of provenance tracking.

external_references -- Cross-database links. Stores reference type (pmid, doi, geo_accession, nct_id, biorxiv_doi, dbgap), reference ID, and description. Unique constraint on (experiment_accession, reference_type, reference_id).

The database location is ~/.encode_connector/tracker.db (macOS/Linux) or %USERPROFILE%\.encode_connector\tracker.db (Windows). The directory is created automatically on first use.

Data Provenance

Log derived files: Use encode_log_derived_file when the user creates files from ENCODE data (filtered peaks, merged signals, etc.).
View provenance: Use encode_get_provenance to trace derived files back to source ENCODE data.

Cross-References

Link external references: Use encode_link_reference to attach PubMed IDs, DOIs, ClinicalTrials NCT IDs, bioRxiv DOIs, or GEO accessions to tracked experiments.
Get references: Use encode_get_references to retrieve linked external identifiers. These IDs can be passed to PubMed, bioRxiv, or ClinicalTrials MCP servers for further analysis.

Walkthrough 1: Building a Pancreatic Islet Epigenome Reference Collection

Goal: Curate a comprehensive set of histone modification ChIP-seq, ATAC-seq, and RNA-seq from human pancreatic islets for enhancer analysis. This is the foundational workflow for any tissue-specific integrative analysis.

Step 1: Discover what data exists

Before tracking anything, survey the landscape. Use facets to understand the breadth of available data for your tissue of interest.

encode_get_facets(facet_field="assay_title", organ="pancreas", organism="Homo sapiens")

Expected output (example):

Histone ChIP-seq: 15 experiments
ATAC-seq: 3 experiments
RNA-seq: 8 experiments
TF ChIP-seq: 4 experiments
WGBS: 2 experiments
DNase-seq: 1 experiment

This tells you that pancreatic tissue has strong histone ChIP-seq coverage (15 experiments across multiple marks), adequate ATAC-seq (3), and solid RNA-seq (8). The 2 WGBS experiments are a bonus for methylation analysis.

Step 2: Search for histone ChIP-seq experiments

Now retrieve the actual experiments. Focus on one assay type at a time to keep notes organized.

encode_search_experiments(assay_title="Histone ChIP-seq", organ="pancreas", organism="Homo sapiens")

Expected return: 15 experiments with targets including H3K27ac, H3K4me1, H3K4me3, H3K27me3, H3K36me3. Review the biosample summaries -- some may be whole pancreas tissue, others isolated islets, and others acinar or ductal cells. This distinction matters for enhancer analysis.

Step 3: Track each histone experiment with descriptive notes

Notes are your lab notebook. Record the histone mark, the specific biosample, and the intended analytical role. This context is invaluable weeks later when you revisit the collection.

encode_track_experiment(accession="ENCSR123ABC", notes="H3K27ac pancreatic islets - active enhancers and super-enhancers")
encode_track_experiment(accession="ENCSR456DEF", notes="H3K4me1 pancreatic islets - primed/poised enhancers")
encode_track_experiment(accession="ENCSR789GHI", notes="H3K4me3 pancreatic islets - active promoters, CpG islands")
encode_track_experiment(accession="ENCSR012JKL", notes="H3K27me3 pancreatic islets - Polycomb repression, bivalent domains")
encode_track_experiment(accession="ENCSR345MNO", notes="H3K36me3 pancreatic islets - gene body transcription elongation")

Why these five marks? Together they define the core chromatin states:

H3K27ac marks active enhancers and promoters (the

track-experiments

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday