Paper-vs-code consistency audit. After research:scientist implements method from paper, verify implementation matches paper claims. Audits five dimensions — formula matching, hyperparameter parity, eval protocol, notation consistency, citation chain. Emits verification table with match status and severity.
NOT for: running experiments (use /research:run); judging experimental methodology (use /research:judge); literature search (use /research:topic); general code review (use /develop:review (requires develop plugin)). Verify audits implementation-vs-paper fidelity only — does not evaluate whether paper's claims are valid.
HARD_CUTOFF: 900 # 15 min — ADVISORY ONLY, not enforced.
Synchronous Agent() has no escape — the parent cannot interrupt mid-flight call.
If the agent exceeds the expected ~15 min budget, the parent must handle the timeout in the NEXT turn
(e.g., user re-invokes, or orchestrator surfaces partial results from $RUN_DIR/audit-raw.md after Agent returns).
Same limitation as research:topic — documented, not bypassable from within the skill.
</constants> <workflow>Agent Resolution
research:scientist in same plugin as this skill — no fallback needed if research plugin installed. Scientist handles all five audit dimensions in single spawn to preserve cross-dimension context (e.g., notation inconsistency explaining formula mismatch requires holistic paper understanding).
Verify Mode (Steps V1–V6)
Triggered by verify <paper> where <paper> is PDF path, arXiv URL, or multi-line quoted text.
Task tracking: create tasks for V1, V2, V3, V4, V5, V6 at start — before any tool calls.
Step V1: Parse paper input
Input resolution (priority order):
- Path ending
.pdf— read via Read tool (usepages: "1-20"for large PDFs; iterate with subsequent page ranges if needed — max 20 pages per Read call) - URL matching
arxiv.org— convertabs/<id>tohttps://arxiv.org/pdf/<id>for actual content fetching (e.g.,ARXIV_URL="${ARXIV_URL//arxiv.org\/abs\//arxiv.org\/pdf\/}"); also fetch abstract page for metadata. Use WebFetch (timeout: 30000). - URL matching
*.pdfordoi.org— WebFetch (timeout: 30000) - Multi-line quoted text block — treat as literal paper content
- No paper argument — stop:
"No paper provided. Usage: /research:verify <paper.pdf|arxiv-url|'pasted text'> [--scope <glob>]"
From paper content, extract:
- Header: title, authors, year (for report)
- Claims table: each claim =
{id, section, claim_text, type}where type is one of:formula,hyperparameter,eval,architecture,result - Focus on: equations with concrete terms, specific hyperparameter values, evaluation protocols (metric names, split names, preprocessing steps), architectural specifics, reported numeric results
Unsupported flag check — after all supported flags extracted, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--scope`, `--program`, `--strict`, `--dim`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.
Pre-compute run directory — persist RUN_DIR and OUT to temp files so V3/V4/V5 (separate Bash shells) can reload them (ADV-H20):
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main') # timeout: 3000
DATE=$(date -u +%Y-%m-%d) # timeout: 3000
RUN_DIR=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/make_run_dir.py" "verify" ".experiments" 2>/dev/null) # timeout: 5000
mkdir -p .reports/research
BASE="verify-$BRANCH-$DATE"; OUT=".reports/research/$BASE.md"; COUNT=2; while [ -f "$OUT" ]; do OUT=".reports/research/${BASE}-${COUNT}.md"; COUNT=$((COUNT+1)); done
# Persist for V3/V4/V5 — each Bash call is a fresh shell, variables do not survive
# Use BRANCH-DATE discriminator to prevent collisions between concurrent verify runs
_VTAG="${BRANCH}-${DATE}"
echo "$RUN_DIR" > "${TMPDIR:-/tmp}/verify-${_VTAG}-run-dir"
echo "$OUT" > "${TMPDIR:-/tmp}/verify-${_VTAG}-out"
# Write latest-tag pointer so rehydration blocks can find the right files
echo "$_VTAG" > "${TMPDIR:-/tmp}/verify-latest-tag"
State-rehydration block (paste at the top of every separate Bash invocation in V3, V4, V5):
_VTAG=$(cat "${TMPDIR:-/tmp}/verify-latest-tag" 2>/dev/null)
RUN_DIR=$(cat "${TMPDIR:-/tmp}/verify-${_VTAG}-run-dir" 2>/dev/null)
OUT=$(cat "${TMPDIR:-/tmp}/verify-${_VTAG}-out" 2>/dev/null)
[ -z "$RUN_DIR" ] || [ -z "$OUT" ] && { echo "verify: state files missing — V1 must run first" >&2; exit 1; }
Step V2: Resolve codebase scope
Scope resolution (priority order):
--scope <glob>flag — use directly--program <program.md>flag — Read file, extractscope_filesfrom## Configfenced block- Auto-detect —
Glob(pattern="**/*.py")up to 100 files; prefer files with ML-relevant imports (torch,tensorflow,sklearn,numpy,jax). If Glob returns 100 files and additional.pyfiles exist (i.e., total may exceed 100): print⚠ Scope truncated at 100 files — large codebase. Fidelity score reflects verified subset only. Use --scope or --program to narrow to relevant modules.
Apply --dim filter: if --dim F,H specified, only audit those dimensions. Default: all five (F,H,E,N,C).
--dim validation: derive $DIM from the --dim flag (default F,H,E,N,C when flag absent), then validate each specified dimension token against the known set before proceeding. Persist status to temp file so V3 (separate Bash shell) can short-circuit when V2 failed (ADV-M25 — bash exit 2 only terminates V2's shell, not the V3 invocation):
DIM="${DIM:-F,H,E,N,C}" # default when --dim flag not supplied
V2_STATUS="ok"
for _DIM_VAL in $(echo "$DIM" | tr ',' ' '); do
case "$_DIM_VAL" in
F|H|E|N|C) ;;
*)
echo "verify: unknown dimension: '$_DIM_VAL' — valid: F,H,E,N,C" >&2
V2_STATUS="failed"
;;
esac
done
_VTAG=$(cat "${TMPDIR:-/tmp}/verify-latest-tag" 2>/dev/null)
echo "$V2_STATUS" > "${TMPDIR:-/tmp}/verify-${_VTAG}-v2-status"
[ "$V2_STATUS" = "failed" ] && exit 2
V3 entry guard (run before any V3 work — paste immediately after the state-rehydration block):
_VTAG=$(cat "${TMPDIR:-/tmp}/verify-latest-tag" 2>/dev/null)
V2_STATUS=$(cat "${TMPDIR:-/tmp}/verify-${_VTAG}-v2-status" 2>/dev/null || echo "ok")
if [ "$V2_STATUS" = "failed" ]; then
echo "verify V3: dimension validation failed in V2 — skipping V3."
exit 1
fi
Step V3: Five-dimension audit via scientist
Spawn research:scientist via Agent(subagent_type="research:scientist", prompt="..."). Single agent handles all five dimensions — cross-dimension context requires holistic paper understanding.
Scientist prompt:
Act as an ML reproducibility auditor verifying implementation fidelity against a published paper.
Paper: <title> (<year>) by <authors>
Paper content: <inline content or path to read>
Claims to verify (from V1 extraction):
<JSON claims table>
Codebase scope files:
<list of files from V2>
Read each file listed in `Codebase scope files` using the Read tool before beginning your audit.
Active dimensions: <F,H,E,N,C or subset from --dim>
Audit the implementation against the paper across the active dimensions:
[F] Formula matching: every equation in the paper with concrete terms — does code implement the same math? Check loss functions, forward passes, norma