Exact Provenance Tracking and Methods Writing
When to Use
- User wants to track the full analysis chain from ENCODE download through processing to publication figure
- User asks about "provenance", "reproducibility", "methods section", or "analysis log"
- User needs to log derived files with their processing parameters for audit trail
- User wants to auto-generate publication-ready methods text from logged analysis steps
- Example queries: "log my peak calling step", "generate a methods section from my analysis", "show the provenance chain for this figure"
Track every operation on ENCODE data with exact tool versions, reference files, scripts, parameters, and timestamps to enable publication-ready methods sections.
Scientific Rationale
The question: "What exactly was done to this data, and can someone else reproduce it identically?"
Reproducibility is the foundation of science. Yet the "Methods" sections of most genomics papers are vague — "reads were aligned with STAR" tells you nothing about which STAR version, which genome index, which parameters, or which annotation version was used. The difference between GENCODE v38 and v39 gene annotations can change thousands of gene assignments.
Comprehensive Provenance Standard
This skill implements a documentation standard where every operation records:
- Tool: Exact name and version (e.g.,
bedtools v2.31.0, not just "bedtools") - Reference files: Exact source, version, and download URL (e.g., "GRCh38.p14 chromosome sizes from UCSC, downloaded 2024-01-15")
- Parameters: Complete command or function call, not just key parameters
- Input: Exact file paths, accessions, and checksums
- Output: File path, description, and checksum
- Script: If a custom script was used, store the script alongside the output
- Timestamp: When the operation was performed
- Environment: R version, Python version, package versions, OS
This creates a complete audit trail such that a methods section can be auto-generated with zero ambiguity.
Why This Level of Detail Matters
Consider a simple liftover operation. A vague log says "coordinates were lifted from hg19 to hg38." A comprehensive provenance log says:
"Genomic coordinates were lifted from GRCh37/hg19 to GRCh38/hg38 using UCSC liftOver (v377, Kent et al. 2002, PMID: 12045153). The chain file hg19ToHg38.over.chain.gz was obtained from UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/, accessed 2024-01-15, MD5: abc123...). Of 45,231 input regions, 44,892 (99.25%) were successfully converted; 339 regions (0.75%) failed to map and were excluded. Unmapped regions were logged to unmapped.bed."
The second version can be reproduced exactly. The first cannot.
Step 1: Initialize Experiment Log
At the start of any analysis session, create an experiment log:
Log Structure
project_dir/
├── experiment_log.json # Machine-readable provenance log
├── scripts/ # All scripts used in this analysis
│ ├── 001_download.sh
│ ├── 002_filter_peaks.sh
│ └── 003_merge_samples.R
├── reference_files/ # Reference files used (or symlinks)
│ ├── GRCh38.chrom.sizes
│ └── gencode.v44.annotation.gtf
├── data/ # ENCODE downloads
│ └── (organized by experiment)
├── derived/ # All derived files
│ ├── filtered_peaks/
│ └── merged_results/
└── methods/ # Auto-generated methods text
└── methods_draft.md
Experiment Log Format (experiment_log.json)
{
"project": "H3K27ac analysis in human pancreas",
"created": "2024-01-15T10:30:00Z",
"analyst": "Dr. A. Mawla",
"organism": "Homo sapiens",
"assembly": "GRCh38",
"gene_annotation": "GENCODE v44",
"operations": [],
"encode_experiments": [],
"software_environment": {},
"reference_files": []
}
Track Source ENCODE Experiments
encode_track_experiment(accession="ENCSR...", notes="Experiment log entry")
For each experiment, record in the log:
| Field | Example | Source |
|---|---|---|
| Accession | ENCSR133RZO | ENCODE portal |
| Assay | Histone ChIP-seq | encode_get_experiment |
| Target | H3K27ac | encode_get_experiment |
| Biosample | pancreas tissue | encode_get_experiment |
| Lab | Bing Ren, UCSD | encode_get_experiment |
| Replicates | 2 biological | encode_get_experiment |
| Sequencer | Illumina HiSeq 4000 | encode_get_experiment |
| Read length | 76bp PE | encode_get_experiment |
| Read count | 42.3M per rep | File metadata |
| Library | TruSeq ChIP | encode_get_experiment |
| Batch/date | 2019-06-15 | encode_get_experiment |
Step 2: Log Every Operation
Operation Log Entry Format
Every processing step creates a log entry with these fields:
{
"operation_id": "op_003",
"timestamp": "2024-01-15T14:22:00Z",
"description": "Filter H3K27ac peaks by signalValue",
"category": "filtering",
"tool": {
"name": "bedtools",
"version": "2.31.0",
"citation": "Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033"
},
"command": "awk '$7 >= 4.5' ENCFF123ABC.bed | bedtools intersect -a stdin -b blacklist.bed -v > filtered_peaks.bed",
"script_path": "scripts/002_filter_peaks.sh",
"inputs": [
{
"file": "ENCFF123ABC.bed",
"accession": "ENCFF123ABC",
"type": "IDR thresholded peaks",
"md5": "abc123..."
}
],
"reference_files": [
{
"file": "hg38-blacklist.v2.bed.gz",
"source": "ENCODE Blacklist v2 (Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z)",
"url": "https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz",
"md5": "def456..."
}
],
"parameters": {
"signalValue_threshold": 4.5,
"blacklist_filter": "exclude overlapping regions"
},
"outputs": [
{
"file": "derived/filtered_peaks/H3K27ac_pancreas_filtered.bed",
"description": "H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
"regions_count": 34521,
"md5": "ghi789..."
}
],
"statistics": {
"input_regions": 45231,
"output_regions": 34521,
"filtered_out": 10710,
"filter_rate": "23.7%"
}
}
Common Operations to Log
Downloading ENCODE Files
encode_download_files(
file_accessions=["ENCFF..."],
download_dir="/path/to/data/",
organize_by="experiment",
verify_md5=True
)
Log: file accession, download URL, MD5 verification result, file size, download timestamp.
Genome Coordinate Liftover
Log: liftOver version, chain file (source URL, date accessed), input count, output count, unmapped count, unmapped file location.
Peak Filtering
Log: filter criteria (signalValue threshold, p-value cutoff), blacklist used and version, input/output region counts, what was removed.
Merging/Union Operations
Log: merge tool + version, merge distance parameter, input files (all accessions), sample tagging method, output count, overlap statistics.
R/Bioconductor Analysis
{
"tool": {
"name": "DESeq2",
"version": "1.42.0",
"r_version": "4.3.2",
"bioconductor_version": "3.18",
"citation": "Love et al. 2014, Genome Biology, DOI:10.1186/s13059-014-0550-8"
},
"command": "DESeq2::results(dds, contrast=c('condition','treated','control'), alpha=0.05)",
"parameters": {
"design_formula": "~ batch + condition",
"contrast": ["condition", "treated", "control"],
"alpha": 0.05,
"lfcThreshold": 0
}
}
Python Analysis
{
"tool": {
"name": "scanpy",
"version": "1.9.6",
"python_version": "3.11.5",
"anndata_version": "0.10.3",
"citation": "Wolf et al. 2018, Genome Biology, DOI:10.1186/s13059-017-1382-0"
}
}
Step 3: Record Software Environment
At the start of each analysis, capture the full environment:
R Environment
sessionInfo()
# Or more detailed:
devtools::session_in