Scientific Writing from ENCODE Provenance
Generate publication-quality scientific writing from ENCODE analysis records. This skill integrates with data-provenance and cite-encode to auto-generate methods from logged pipeline runs. Every generated section follows rigorous scientific documentation standards -- complete reporting of all experimental and computational parameters with zero ambiguity.
When to Use
- User wants to write publication-ready methods sections, figure legends, or data availability statements
- User asks about "methods section", "figure legend", "scientific writing", or "manuscript preparation"
- User needs to auto-generate methods text from logged provenance/analysis steps
- User wants templates for supplementary tables, Key Resources Tables, or tool citation formatting
- Example queries: "write a methods section for my ChIP-seq analysis", "generate a figure legend for my heatmap", "format my data availability statement"
Overview
Most methods sections in genomics papers are incomplete. They omit software versions, skip reference file details, conflate technical and biological replicates, and use phrases like "default parameters" without stating what those defaults are. Reviewers catch these omissions, and readers cannot reproduce the analysis.
This skill solves the problem by generating methods text directly from the provenance chain. When every processing step has been logged (via data-provenance), the methods section writes itself. When metadata has been captured from ENCODE (via track-experiments), the experimental details are already recorded. This skill assembles these records into publication-ready prose, figure legends, supplementary tables, and data availability statements.
This standard is not aspirational -- it is the minimum bar for reproducible science.
Scientific Documentation Standards -- Required Metadata
Every methods section MUST report the following fields. Omitting any of these fields produces an incomplete methods section that reviewers will flag and readers cannot reproduce.
| Field | Example | Why Required |
|---|---|---|
| Library preparation | TruSeq ChIP | Affects fragment size distribution and GC bias |
| Biological replicates | n=2 per condition | Statistical power and reproducibility |
| Cells/nuclei per replicate | 50,000 cells | Input sufficiency for the assay |
| Sequencing reads | 30M paired-end | Coverage depth determines sensitivity |
| Read length | 2x150 bp | Alignment accuracy and mappability |
| Paired/single-end | Paired-end | Fragment size estimation, structural variants |
| Sequencer | NovaSeq 6000 | Quality profile, error model, binning |
| Lab/batch | Snyder Lab, Stanford | Batch effect awareness |
| Reference genome | GRCh38/hg38 | Coordinate system for all downstream analysis |
| Gene annotation | GENCODE v44 | Gene definitions change between versions |
| ENCODE accessions | ENCSR133RZO | Exact data provenance for reproducibility |
| Blacklist version | ENCODE Blacklist v2 | Artifact exclusion affects all peak-based analyses |
How to Populate These Fields
# Track the experiment to capture metadata
encode_track_experiment(accession="ENCSR...", fetch_publications=True)
# Get full experiment details
encode_get_experiment(accession="ENCSR...")
# Get file-level metadata
encode_get_file_info(accession="ENCFF...")
# Get provenance for derived files
encode_get_provenance(file_path="/path/to/derived/file.bed")
Methods Section Templates
Each template below is a fill-in-the-blank paragraph that reads like a real methods section. Bracketed fields [like this] are populated from ENCODE metadata and provenance records. Every template follows these documentation standards.
ChIP-seq Methods
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for
[target] in [biosample] were obtained from the ENCODE Project (ENCODE
Project Consortium 2020) under accession [ENCSR accession]. [Library
preparation method] libraries were prepared from [number] biological
replicates ([cells/nuclei] per replicate) and sequenced on an Illumina
[sequencer model] to generate [read count]M [paired-end/single-end]
reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
trimmed with Trim Galore (v[version]; Krueger 2015) to remove adapter
sequences and low-quality bases (Phred < 20). Trimmed reads were aligned
to the [organism] reference genome ([assembly]) using BWA-MEM (v[version];
Li 2013) with default parameters. Duplicate reads were marked and removed
using Picard MarkDuplicates (v[version]; Broad Institute). Reads with
mapping quality < 30 were excluded using samtools (v[version]; Danecek
et al. 2021). Reads mapping to ENCODE Blacklist v2 regions (Amemiya et al.
2019) were removed using bedtools intersect (v[version]; Quinlan & Hall
2010).
Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters [--broad for broad marks / -q 0.05 for narrow marks]. For
narrow-peak targets, IDR analysis (Li et al. 2011) was performed on
replicate peak sets with a threshold of [0.05]. Signal tracks (fold
change over control) were generated using MACS2 bdgcmp and converted
to bigWig format using bedGraphToBigWig (Kent et al. 2010). Of [N]
called peaks, [N] ([%]) passed IDR filtering and [N] ([%]) remained
after blacklist removal.
ATAC-seq Methods
Assay for Transposase-Accessible Chromatin with sequencing (ATAC-seq) data
for [biosample] were obtained from the ENCODE Project (ENCODE Project
Consortium 2020) under accession [ENCSR accession]. [Number] biological
replicates of [cells/nuclei] [cells/nuclei] each were transposed with
Tn5 transposase ([library kit]) and sequenced on an Illumina [sequencer]
to generate [read count]M [paired-end/single-end] reads of [read length]
bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
adapter-trimmed with Trim Galore (v[version]; Krueger 2015). Trimmed
reads were aligned to [assembly] using Bowtie2 (v[version]; Langmead &
Salzberg 2012) with parameters --very-sensitive -X 2000 --no-mixed
--no-discordant. Mitochondrial reads were removed. Duplicate reads were
removed using Picard MarkDuplicates (v[version]; Broad Institute). Reads
with mapping quality < 30 were excluded. Tn5 transposase offset
correction was applied (+4 bp on the positive strand, -5 bp on the
negative strand; Buenrostro et al. 2013). ENCODE Blacklist v2 regions
(Amemiya et al. 2019) were excluded.
Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters --nomodel --shift -75 --extsize 150 --keep-dup all -q 0.05.
Nucleosome-free fragments (< 150 bp) were used for peak calling. Signal
tracks were generated as fold change over background. TSS enrichment
score was [value] (threshold >= 6; ENCODE data standards; Yan et al. 2020). Of [N]
called peaks, [N] ([%]) passed quality filtering.
RNA-seq Methods
RNA sequencing (RNA-seq) data for [biosample] were obtained from the
ENCODE Project (ENCODE Project Consortium 2020) under accession [ENCSR
accession]. Total RNA was extracted from [number] biological replicates
and [library preparation method] libraries were prepared. Libraries were
sequenced on an Illumina [sequencer] to generate [read count]M
[paired-end/single-end] reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
MultiQC (v[version]; Ewels et al. 2016). Adapter sequences were trimmed
with Trim Galore (v[version]; Krueger 2015). Reads were aligned to
[assembly] with [GENCODE annotation version] gene annotations using STAR
(v[version]; Dobin et al. 2013) in two-pass mode. Gene-level
quantification was performed using RSEM (v[version]; Li & Dewey 2011)
for expected counts and TPM values. Transcript-level quantification was
obtained with Kallisto (v[version]; Bray et al. 2016). Mapping rate was
[%] and rRNA contamination was [%] (thresholds: ma