Exact Provenance Tracking and Methods Writing

When to Use

User wants to track the full analysis chain from ENCODE download through processing to publication figure
User asks about "provenance", "reproducibility", "methods section", or "analysis log"
User needs to log derived files with their processing parameters for audit trail
User wants to auto-generate publication-ready methods text from logged analysis steps
Example queries: "log my peak calling step", "generate a methods section from my analysis", "show the provenance chain for this figure"

Track every operation on ENCODE data with exact tool versions, reference files, scripts, parameters, and timestamps to enable publication-ready methods sections.

Scientific Rationale

The question: "What exactly was done to this data, and can someone else reproduce it identically?"

Reproducibility is the foundation of science. Yet the "Methods" sections of most genomics papers are vague — "reads were aligned with STAR" tells you nothing about which STAR version, which genome index, which parameters, or which annotation version was used. The difference between GENCODE v38 and v39 gene annotations can change thousands of gene assignments.

Comprehensive Provenance Standard

This skill implements a documentation standard where every operation records:

Tool: Exact name and version (e.g., bedtools v2.31.0, not just "bedtools")
Reference files: Exact source, version, and download URL (e.g., "GRCh38.p14 chromosome sizes from UCSC, downloaded 2024-01-15")
Parameters: Complete command or function call, not just key parameters
Input: Exact file paths, accessions, and checksums
Output: File path, description, and checksum
Script: If a custom script was used, store the script alongside the output
Timestamp: When the operation was performed
Environment: R version, Python version, package versions, OS

This creates a complete audit trail such that a methods section can be auto-generated with zero ambiguity.

Why This Level of Detail Matters

Consider a simple liftover operation. A vague log says "coordinates were lifted from hg19 to hg38." A comprehensive provenance log says:

"Genomic coordinates were lifted from GRCh37/hg19 to GRCh38/hg38 using UCSC liftOver (v377, Kent et al. 2002, PMID: 12045153). The chain file hg19ToHg38.over.chain.gz was obtained from UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/, accessed 2024-01-15, MD5: abc123...). Of 45,231 input regions, 44,892 (99.25%) were successfully converted; 339 regions (0.75%) failed to map and were excluded. Unmapped regions were logged to unmapped.bed."

The second version can be reproduced exactly. The first cannot.

Step 1: Initialize Experiment Log

At the start of any analysis session, create an experiment log:

Log Structure

project_dir/
├── experiment_log.json          # Machine-readable provenance log
├── scripts/                     # All scripts used in this analysis
│   ├── 001_download.sh
│   ├── 002_filter_peaks.sh
│   └── 003_merge_samples.R
├── reference_files/             # Reference files used (or symlinks)
│   ├── GRCh38.chrom.sizes
│   └── gencode.v44.annotation.gtf
├── data/                        # ENCODE downloads
│   └── (organized by experiment)
├── derived/                     # All derived files
│   ├── filtered_peaks/
│   └── merged_results/
└── methods/                     # Auto-generated methods text
    └── methods_draft.md

Experiment Log Format (experiment_log.json)

{
  "project": "H3K27ac analysis in human pancreas",
  "created": "2024-01-15T10:30:00Z",
  "analyst": "Dr. A. Mawla",
  "organism": "Homo sapiens",
  "assembly": "GRCh38",
  "gene_annotation": "GENCODE v44",
  "operations": [],
  "encode_experiments": [],
  "software_environment": {},
  "reference_files": []
}

Track Source ENCODE Experiments

encode_track_experiment(accession="ENCSR...", notes="Experiment log entry")

For each experiment, record in the log:

Field	Example	Source
Accession	ENCSR133RZO	ENCODE portal
Assay	Histone ChIP-seq	encode_get_experiment
Target	H3K27ac	encode_get_experiment
Biosample	pancreas tissue	encode_get_experiment
Lab	Bing Ren, UCSD	encode_get_experiment
Replicates	2 biological	encode_get_experiment
Sequencer	Illumina HiSeq 4000	encode_get_experiment
Read length	76bp PE	encode_get_experiment
Read count	42.3M per rep	File metadata
Library	TruSeq ChIP	encode_get_experiment
Batch/date	2019-06-15	encode_get_experiment

Step 2: Log Every Operation

Operation Log Entry Format

Every processing step creates a log entry with these fields:

{
  "operation_id": "op_003",
  "timestamp": "2024-01-15T14:22:00Z",
  "description": "Filter H3K27ac peaks by signalValue",
  "category": "filtering",
  "tool": {
    "name": "bedtools",
    "version": "2.31.0",
    "citation": "Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033"
  },
  "command": "awk '$7 >= 4.5' ENCFF123ABC.bed | bedtools intersect -a stdin -b blacklist.bed -v > filtered_peaks.bed",
  "script_path": "scripts/002_filter_peaks.sh",
  "inputs": [
    {
      "file": "ENCFF123ABC.bed",
      "accession": "ENCFF123ABC",
      "type": "IDR thresholded peaks",
      "md5": "abc123..."
    }
  ],
  "reference_files": [
    {
      "file": "hg38-blacklist.v2.bed.gz",
      "source": "ENCODE Blacklist v2 (Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z)",
      "url": "https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz",
      "md5": "def456..."
    }
  ],
  "parameters": {
    "signalValue_threshold": 4.5,
    "blacklist_filter": "exclude overlapping regions"
  },
  "outputs": [
    {
      "file": "derived/filtered_peaks/H3K27ac_pancreas_filtered.bed",
      "description": "H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
      "regions_count": 34521,
      "md5": "ghi789..."
    }
  ],
  "statistics": {
    "input_regions": 45231,
    "output_regions": 34521,
    "filtered_out": 10710,
    "filter_rate": "23.7%"
  }
}

Common Operations to Log

Downloading ENCODE Files

encode_download_files(
    file_accessions=["ENCFF..."],
    download_dir="/path/to/data/",
    organize_by="experiment",
    verify_md5=True
)

Log: file accession, download URL, MD5 verification result, file size, download timestamp.

Genome Coordinate Liftover

Log: liftOver version, chain file (source URL, date accessed), input count, output count, unmapped count, unmapped file location.

Peak Filtering

Log: filter criteria (signalValue threshold, p-value cutoff), blacklist used and version, input/output region counts, what was removed.

Merging/Union Operations

Log: merge tool + version, merge distance parameter, input files (all accessions), sample tagging method, output count, overlap statistics.

R/Bioconductor Analysis

{
  "tool": {
    "name": "DESeq2",
    "version": "1.42.0",
    "r_version": "4.3.2",
    "bioconductor_version": "3.18",
    "citation": "Love et al. 2014, Genome Biology, DOI:10.1186/s13059-014-0550-8"
  },
  "command": "DESeq2::results(dds, contrast=c('condition','treated','control'), alpha=0.05)",
  "parameters": {
    "design_formula": "~ batch + condition",
    "contrast": ["condition", "treated", "control"],
    "alpha": 0.05,
    "lfcThreshold": 0
  }
}

Python Analysis

{
  "tool": {
    "name": "scanpy",
    "version": "1.9.6",
    "python_version": "3.11.5",
    "anndata_version": "0.10.3",
    "citation": "Wolf et al. 2018, Genome Biology, DOI:10.1186/s13059-017-1382-0"
  }
}

Step 3: Record Software Environment

At the start of each analysis, capture the full environment:

R Environment

sessionInfo()
# Or more detailed:
devtools::session_in

data-provenance

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday