Scientific Writing from ENCODE Provenance

Generate publication-quality scientific writing from ENCODE analysis records. This skill integrates with data-provenance and cite-encode to auto-generate methods from logged pipeline runs. Every generated section follows rigorous scientific documentation standards -- complete reporting of all experimental and computational parameters with zero ambiguity.

When to Use

User wants to write publication-ready methods sections, figure legends, or data availability statements
User asks about "methods section", "figure legend", "scientific writing", or "manuscript preparation"
User needs to auto-generate methods text from logged provenance/analysis steps
User wants templates for supplementary tables, Key Resources Tables, or tool citation formatting
Example queries: "write a methods section for my ChIP-seq analysis", "generate a figure legend for my heatmap", "format my data availability statement"

Overview

Most methods sections in genomics papers are incomplete. They omit software versions, skip reference file details, conflate technical and biological replicates, and use phrases like "default parameters" without stating what those defaults are. Reviewers catch these omissions, and readers cannot reproduce the analysis.

This skill solves the problem by generating methods text directly from the provenance chain. When every processing step has been logged (via data-provenance), the methods section writes itself. When metadata has been captured from ENCODE (via track-experiments), the experimental details are already recorded. This skill assembles these records into publication-ready prose, figure legends, supplementary tables, and data availability statements.

This standard is not aspirational -- it is the minimum bar for reproducible science.

Scientific Documentation Standards -- Required Metadata

Every methods section MUST report the following fields. Omitting any of these fields produces an incomplete methods section that reviewers will flag and readers cannot reproduce.

Field	Example	Why Required
Library preparation	TruSeq ChIP	Affects fragment size distribution and GC bias
Biological replicates	n=2 per condition	Statistical power and reproducibility
Cells/nuclei per replicate	50,000 cells	Input sufficiency for the assay
Sequencing reads	30M paired-end	Coverage depth determines sensitivity
Read length	2x150 bp	Alignment accuracy and mappability
Paired/single-end	Paired-end	Fragment size estimation, structural variants
Sequencer	NovaSeq 6000	Quality profile, error model, binning
Lab/batch	Snyder Lab, Stanford	Batch effect awareness
Reference genome	GRCh38/hg38	Coordinate system for all downstream analysis
Gene annotation	GENCODE v44	Gene definitions change between versions
ENCODE accessions	ENCSR133RZO	Exact data provenance for reproducibility
Blacklist version	ENCODE Blacklist v2	Artifact exclusion affects all peak-based analyses

How to Populate These Fields

# Track the experiment to capture metadata
encode_track_experiment(accession="ENCSR...", fetch_publications=True)

# Get full experiment details
encode_get_experiment(accession="ENCSR...")

# Get file-level metadata
encode_get_file_info(accession="ENCFF...")

# Get provenance for derived files
encode_get_provenance(file_path="/path/to/derived/file.bed")

Methods Section Templates

Each template below is a fill-in-the-blank paragraph that reads like a real methods section. Bracketed fields [like this] are populated from ENCODE metadata and provenance records. Every template follows these documentation standards.

ChIP-seq Methods

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for
[target] in [biosample] were obtained from the ENCODE Project (ENCODE
Project Consortium 2020) under accession [ENCSR accession]. [Library
preparation method] libraries were prepared from [number] biological
replicates ([cells/nuclei] per replicate) and sequenced on an Illumina
[sequencer model] to generate [read count]M [paired-end/single-end]
reads of [read length] bp per replicate.

Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
trimmed with Trim Galore (v[version]; Krueger 2015) to remove adapter
sequences and low-quality bases (Phred < 20). Trimmed reads were aligned
to the [organism] reference genome ([assembly]) using BWA-MEM (v[version];
Li 2013) with default parameters. Duplicate reads were marked and removed
using Picard MarkDuplicates (v[version]; Broad Institute). Reads with
mapping quality < 30 were excluded using samtools (v[version]; Danecek
et al. 2021). Reads mapping to ENCODE Blacklist v2 regions (Amemiya et al.
2019) were removed using bedtools intersect (v[version]; Quinlan & Hall
2010).

Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters [--broad for broad marks / -q 0.05 for narrow marks]. For
narrow-peak targets, IDR analysis (Li et al. 2011) was performed on
replicate peak sets with a threshold of [0.05]. Signal tracks (fold
change over control) were generated using MACS2 bdgcmp and converted
to bigWig format using bedGraphToBigWig (Kent et al. 2010). Of [N]
called peaks, [N] ([%]) passed IDR filtering and [N] ([%]) remained
after blacklist removal.

ATAC-seq Methods

Assay for Transposase-Accessible Chromatin with sequencing (ATAC-seq) data
for [biosample] were obtained from the ENCODE Project (ENCODE Project
Consortium 2020) under accession [ENCSR accession]. [Number] biological
replicates of [cells/nuclei] [cells/nuclei] each were transposed with
Tn5 transposase ([library kit]) and sequenced on an Illumina [sequencer]
to generate [read count]M [paired-end/single-end] reads of [read length]
bp per replicate.

Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
adapter-trimmed with Trim Galore (v[version]; Krueger 2015). Trimmed
reads were aligned to [assembly] using Bowtie2 (v[version]; Langmead &
Salzberg 2012) with parameters --very-sensitive -X 2000 --no-mixed
--no-discordant. Mitochondrial reads were removed. Duplicate reads were
removed using Picard MarkDuplicates (v[version]; Broad Institute). Reads
with mapping quality < 30 were excluded. Tn5 transposase offset
correction was applied (+4 bp on the positive strand, -5 bp on the
negative strand; Buenrostro et al. 2013). ENCODE Blacklist v2 regions
(Amemiya et al. 2019) were excluded.

Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters --nomodel --shift -75 --extsize 150 --keep-dup all -q 0.05.
Nucleosome-free fragments (< 150 bp) were used for peak calling. Signal
tracks were generated as fold change over background. TSS enrichment
score was [value] (threshold >= 6; ENCODE data standards; Yan et al. 2020). Of [N]
called peaks, [N] ([%]) passed quality filtering.

RNA-seq Methods

RNA sequencing (RNA-seq) data for [biosample] were obtained from the
ENCODE Project (ENCODE Project Consortium 2020) under accession [ENCSR
accession]. Total RNA was extracted from [number] biological replicates
and [library preparation method] libraries were prepared. Libraries were
sequenced on an Illumina [sequencer] to generate [read count]M
[paired-end/single-end] reads of [read length] bp per replicate.

Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
MultiQC (v[version]; Ewels et al. 2016). Adapter sequences were trimmed
with Trim Galore (v[version]; Krueger 2015). Reads were aligned to
[assembly] with [GENCODE annotation version] gene annotations using STAR
(v[version]; Dobin et al. 2013) in two-pass mode. Gene-level
quantification was performed using RSEM (v[version]; Li & Dewey 2011)
for expected counts and TPM values. Transcript-level quantification was
obtained with Kallisto (v[version]; Bray et al. 2016). Mapping rate was
[%] and rRNA contamination was [%] (thresholds: ma

scientific-writing

How to add

Drop this on your repo README

Related skills

algorithmic-art

doc-coauthoring

blog-writing-guide

agents-md

Get new Escrita e Conteúdo skills every Monday