Post-run retrospective analysis. After /research:run completes, reads .experiments/state/<run-id>/experiments.jsonl, computes statistical significance, detects dead iterations, flags suspicious metric jumps, generates learning summary with next-hypothesis queue.
NOT for: running experiments (use /research:run); designing experiments (use /research:plan); validating methodology (use /research:judge); verifying paper implementation (use /research:verify). Read-only — never modifies code, commits, or experiment state.
Agent Resolution
_RESEARCH_SHARED=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/resolve_shared.py" 2>/dev/null) # timeout: 5000
Read $_RESEARCH_SHARED/agent-resolution.md (use the path printed by the bash block above — substitute the resolved value, do not pass the literal $_RESEARCH_SHARED string to the Read tool). Contains: foundry check + fallback table. research:scientist in same plugin — no fallback needed if research plugin installed.
Retro Mode (Steps T1–T7)
Triggered by retro, retro <run-id>, or retro <run-id> --compare <run-id-2>.
Defaults: --threshold 0.001, --alpha 0.05.
Unsupported flag check — after all supported flags extracted, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--compare`, `--threshold`, `--alpha`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.
Task tracking: create tasks for T1–T7 at start — before any tool calls.
Step T1: Locate and load run data
Input resolution (priority order):
- Explicit
<run-id>argument → read.experiments/state/<run-id>/ - No argument → scan
.experiments/state/, pick latest dir wherestate.jsonhasstatus: completedorstatus: goal-achieved - None found → stop with error:
No completed run found. Run /research:run first, or provide: /research:retro <run-id>
Load files from .experiments/state/<run-id>/:
state.json: extractgoal,best_metric,config(includingmetric.direction),iterationcount,best_commit. Computebaseline_metricfrom iteration 0 inexperiments.jsonl.experiments.jsonl: full iteration history — validate each line parses as JSON. If last line truncated, warn and skip.diary.md: if present, read for qualitative context in T5.
If --compare <run-id-2> present: load second run identically from .experiments/state/<run-id-2>/. If not found, stop: "Compare target not found: .experiments/state/<run-id-2>/. Check run ID and retry."
Assign RUN_ID_ARG from $ARGUMENTS — first positional non-flag token, empty if absent (ADV-H17):
# Strip known flags before extracting positional; only positional is treated as run-id
_REMAINDER=$(echo "$ARGUMENTS" | sed -E 's/--compare[= ]+[^ ]+//g; s/--threshold[= ]+[^ ]+//g; s/--alpha[= ]+[^ ]+//g')
RUN_ID_ARG=$(echo "$_REMAINDER" | awk '{for (i=1; i<=NF; i++) if ($i !~ /^--/) { print $i; exit }}')
RUN_ID_ARG="${RUN_ID_ARG:-}"
# Persist for T3 (separate Bash shell — variables lost between calls)
echo "$RUN_ID_ARG" > "${TMPDIR:-/tmp}/retro-run-id"
Pre-compute run directory — also fix $RUN_ID (resolved from input resolution above) and persist $RUN_DIR for T3 (ADV-H18 + ADV-L16):
# RUN_ID = run-id argument if provided, else dir name of latest completed run <!-- loads: find_run_id.py -->
RUN_ID="${RUN_ID_ARG:-$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/find_run_id.py" .experiments/state 2>/dev/null)}"
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main') # timeout: 3000
echo "$RUN_ID" > "${TMPDIR:-/tmp}/retro-run-id-resolved" # persist resolved id for T3 / fallback paths
RUN_DIR=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/make_run_dir.py" "retro" ".experiments" 2>/dev/null) # timeout: 5000
mkdir -p "$RUN_DIR/scripts" # timeout: 3000
echo "$RUN_DIR" > "${TMPDIR:-/tmp}/retro-run-dir" # T3 + fallback path reload from temp file
Step T2: Statistical significance analysis
Run the Wilcoxon signed-rank test via the bundled bin/ script — pure Python with scipy.stats:
ALPHA="${ALPHA:-0.05}"
METRIC_DIRECTION=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/read_state_field.py" ".experiments/state/$RUN_ID/state.json" "config.metric.direction" --default "higher" 2>/dev/null || echo "higher") # loads: read_state_field.py
RETRO_RESULT=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/retro_analyze.py" --jsonl ".experiments/state/$RUN_ID/experiments.jsonl" --baseline "baseline" --alpha "$ALPHA" --direction "$METRIC_DIRECTION") # timeout: 30000
Contract — script reads JSONL, extracts metric values for ALL iterations with status == "kept", pairs each against the baseline record (status == "baseline"), runs a one-sided Wilcoxon signed-rank test, and prints a single line of JSON to stdout:
{"significant": bool, "p_value": float, "statistic": float, "n": int}on success{"significant": false, "p_value": null, "statistic": null, "n": <N>, "reason": "<msg>"}whenN < 6or scipy missing{"error": "<msg>"}on input error (exit 2 — missing file, malformed JSON, no baseline record)
Exit codes: 0 = significant · 1 = not significant (or insufficient data) · 2 = input error.
Direction handling — script branches on --direction:
higher→alternative = "greater"(improvement = candidate > baseline)lower→alternative = "less"(improvement = candidate < baseline — for loss, latency, error)
Read direction from state.json config (or infer from goal text); pass via $METRIC_DIRECTION.
Effect size — script does not return rank-biserial r directly. Compute via the bundled bin/ script:
EFFECT_R=$(echo "$RETRO_RESULT" | python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/compute_effect_size.py") # timeout: 5000
If --compare: invoke the script a second time on the second run's experiments.jsonl; downstream report renders a second row.
Write the combined results (parsed JSON plus computed r) to $RUN_DIR/stats-results.json via Write tool.
Step T3: Dead iteration detection
Definition: dead iteration window = 3+ consecutive iterations (any status) where abs(metric_delta) < threshold (default --threshold 0.001).
Scale check (after loading baseline_metric in T1): if baseline_metric > 100 * threshold, print:
! Threshold advisory: baseline_metric=[value] is >100x the default threshold (0.001).
For this metric scale, consider: --threshold [baseline_metric * 0.0001:.4f]
Proceeding with --threshold [threshold] — override with: /research:retro <run-id> --threshold <value>
Apply advisory threshold automatically only when --threshold not explicitly provided by user.
Timeout detection: when scanning reverted iterations, check status field. If status == "timeout": classify as timeout-as-revert (see Notes). Otherwise: flag any reverted iteration where delta is in the correct improvement direction (i.e., metric moved toward goal) as "possible timeout — verify commit [sha]"; do not count delta as valid.
Scan experiments.jsonl sequentially, skipping iteration 0 (baseline). For each window of 3+ consecutive iterations where abs(delta) < threshold:
- Record:
start_iter,end_iter,count - Classify type:
dead-plateauif all iterations in window havestatus: kept;dead-churnif mixedkept/reverted/other - Compute
wasted_iters= total iterations in all dead windows
Re-hydrate cross-Bash state at the start of every separate Bash invocation in T3 (each Bash call is a fresh shell — $RUN_DIR / $RUN_ID_ARG lost across calls; ADV-H18 / ADV-L16):
RUN_DIR=$(cat "${TMPDIR:-/tmp}/retro-run-dir" 2>/dev/null)
RUN_ID_ARG=$(cat "${TMPDIR: