Post-run retrospective analysis. After /research:run completes, reads .experiments/state/<run-id>/experiments.jsonl, computes statistical significance, detects dead iterations, flags suspicious metric jumps, generates learning summary with next-hypothesis queue.

NOT for: running experiments (use /research:run); designing experiments (use /research:plan); validating methodology (use /research:judge); verifying paper implementation (use /research:verify). Read-only — never modifies code, commits, or experiment state.

</objective> <workflow>

Agent Resolution

_RESEARCH_SHARED=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/resolve_shared.py" 2>/dev/null)  # timeout: 5000

Read $_RESEARCH_SHARED/agent-resolution.md (use the path printed by the bash block above — substitute the resolved value, do not pass the literal $_RESEARCH_SHARED string to the Read tool). Contains: foundry check + fallback table. research:scientist in same plugin — no fallback needed if research plugin installed.

Retro Mode (Steps T1–T7)

Triggered by retro, retro <run-id>, or retro <run-id> --compare <run-id-2>.

Defaults: --threshold 0.001, --alpha 0.05.

Unsupported flag check — after all supported flags extracted, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--compare`, `--threshold`, `--alpha`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.

Task tracking: create tasks for T1–T7 at start — before any tool calls.

Step T1: Locate and load run data

Input resolution (priority order):

Explicit <run-id> argument → read .experiments/state/<run-id>/
No argument → scan .experiments/state/, pick latest dir where state.json has status: completed or status: goal-achieved

None found → stop with error:

No completed run found. Run /research:run first, or provide: /research:retro <run-id>

Load files from .experiments/state/<run-id>/:

state.json: extract goal, best_metric, config (including metric.direction), iteration count, best_commit. Compute baseline_metric from iteration 0 in experiments.jsonl.
experiments.jsonl: full iteration history — validate each line parses as JSON. If last line truncated, warn and skip.
diary.md: if present, read for qualitative context in T5.

If --compare <run-id-2> present: load second run identically from .experiments/state/<run-id-2>/. If not found, stop: "Compare target not found: .experiments/state/<run-id-2>/. Check run ID and retry."

Assign RUN_ID_ARG from $ARGUMENTS — first positional non-flag token, empty if absent (ADV-H17):

# Strip known flags before extracting positional; only positional is treated as run-id
_REMAINDER=$(echo "$ARGUMENTS" | sed -E 's/--compare[= ]+[^ ]+//g; s/--threshold[= ]+[^ ]+//g; s/--alpha[= ]+[^ ]+//g')
RUN_ID_ARG=$(echo "$_REMAINDER" | awk '{for (i=1; i<=NF; i++) if ($i !~ /^--/) { print $i; exit }}')
RUN_ID_ARG="${RUN_ID_ARG:-}"
# Persist for T3 (separate Bash shell — variables lost between calls)
echo "$RUN_ID_ARG" > "${TMPDIR:-/tmp}/retro-run-id"

Pre-compute run directory — also fix $RUN_ID (resolved from input resolution above) and persist $RUN_DIR for T3 (ADV-H18 + ADV-L16):

# RUN_ID = run-id argument if provided, else dir name of latest completed run  <!-- loads: find_run_id.py -->
RUN_ID="${RUN_ID_ARG:-$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/find_run_id.py" .experiments/state 2>/dev/null)}"
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main')  # timeout: 3000
echo "$RUN_ID" > "${TMPDIR:-/tmp}/retro-run-id-resolved"  # persist resolved id for T3 / fallback paths

RUN_DIR=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/make_run_dir.py" "retro" ".experiments" 2>/dev/null)  # timeout: 5000
mkdir -p "$RUN_DIR/scripts"  # timeout: 3000
echo "$RUN_DIR" > "${TMPDIR:-/tmp}/retro-run-dir"  # T3 + fallback path reload from temp file

Step T2: Statistical significance analysis

Run the Wilcoxon signed-rank test via the bundled bin/ script — pure Python with scipy.stats:

ALPHA="${ALPHA:-0.05}"
METRIC_DIRECTION=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/read_state_field.py" ".experiments/state/$RUN_ID/state.json" "config.metric.direction" --default "higher" 2>/dev/null || echo "higher")  # loads: read_state_field.py
RETRO_RESULT=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/retro_analyze.py" --jsonl ".experiments/state/$RUN_ID/experiments.jsonl" --baseline "baseline" --alpha "$ALPHA" --direction "$METRIC_DIRECTION")  # timeout: 30000

Contract — script reads JSONL, extracts metric values for ALL iterations with status == "kept", pairs each against the baseline record (status == "baseline"), runs a one-sided Wilcoxon signed-rank test, and prints a single line of JSON to stdout:

{"significant": bool, "p_value": float, "statistic": float, "n": int} on success
{"significant": false, "p_value": null, "statistic": null, "n": <N>, "reason": "<msg>"} when N < 6 or scipy missing
{"error": "<msg>"} on input error (exit 2 — missing file, malformed JSON, no baseline record)

Exit codes: 0 = significant · 1 = not significant (or insufficient data) · 2 = input error.

Direction handling — script branches on --direction:

higher → alternative = "greater" (improvement = candidate > baseline)
lower → alternative = "less" (improvement = candidate < baseline — for loss, latency, error)

Read direction from state.json config (or infer from goal text); pass via $METRIC_DIRECTION.

Effect size — script does not return rank-biserial r directly. Compute via the bundled bin/ script:

EFFECT_R=$(echo "$RETRO_RESULT" | python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/compute_effect_size.py")  # timeout: 5000

If --compare: invoke the script a second time on the second run's experiments.jsonl; downstream report renders a second row.

Write the combined results (parsed JSON plus computed r) to $RUN_DIR/stats-results.json via Write tool.

Step T3: Dead iteration detection

Definition: dead iteration window = 3+ consecutive iterations (any status) where abs(metric_delta) < threshold (default --threshold 0.001).

Scale check (after loading baseline_metric in T1): if baseline_metric > 100 * threshold, print:

! Threshold advisory: baseline_metric=[value] is >100x the default threshold (0.001).
  For this metric scale, consider: --threshold [baseline_metric * 0.0001:.4f]
  Proceeding with --threshold [threshold] — override with: /research:retro <run-id> --threshold <value>

Apply advisory threshold automatically only when --threshold not explicitly provided by user.

Timeout detection: when scanning reverted iterations, check status field. If status == "timeout": classify as timeout-as-revert (see Notes). Otherwise: flag any reverted iteration where delta is in the correct improvement direction (i.e., metric moved toward goal) as "possible timeout — verify commit [sha]"; do not count delta as valid.

Scan experiments.jsonl sequentially, skipping iteration 0 (baseline). For each window of 3+ consecutive iterations where abs(delta) < threshold:

Record: start_iter, end_iter, count
Classify type: dead-plateau if all iterations in window have status: kept; dead-churn if mixed kept/reverted/other
Compute wasted_iters = total iterations in all dead windows

Re-hydrate cross-Bash state at the start of every separate Bash invocation in T3 (each Bash call is a fresh shell — $RUN_DIR / $RUN_ID_ARG lost across calls; ADV-H18 / ADV-L16):

RUN_DIR=$(cat "${TMPDIR:-/tmp}/retro-run-dir" 2>/dev/null)
RUN_ID_ARG=$(cat "${TMPDIR:

retro

Como adicionar

Cole no README do seu repo

Skills relacionadas

learn-codebase

remove-deadcode

apify-competitor-intelligence

ad-creative

Receba novas skills de Marketing toda segunda

Agent Resolution

Retro Mode (Steps T1–T7)

Step T1: Locate and load run data

Step T2: Statistical significance analysis

Step T3: Dead iteration detection

Comentários · Nenhum comentário