Ablation study runner — after /research:run finds improvements, fortify identifies which components contributed, generates ablation variants (remove one component at a time), runs each in isolated git worktrees (main repo never modified), ranks component importance, optionally generates reviewer Q&A calibrated to venue.
NOT for: initial optimization loop (use /research:run); methodology validation (use /research:judge); paper-vs-code consistency (use /research:verify); hypothesis generation (use research:scientist directly). Fortify runs ablation studies on completed runs only.
MAX_ABLATION_CANDIDATES: 8 (ceiling — scientist produces 3–8; --max-ablations caps further)
METRIC_TIMEOUT_MS: 360000 (6 min — same as run SKILL.md)
GUARD_TIMEOUT_MS: 360000
GIT_OP_TIMEOUT_MS: 15000
SANITY_DIVERGENCE_PCT: 2.0 (full-variant vs best_metric mismatch threshold)
IMPORTANCE_CLASS_CRITICAL: 50.0 (% of full metric lost)
IMPORTANCE_CLASS_SIGNIFICANT: 10.0
FORTIFY_DIR_BASE: .experiments
STATE_DIR_BASE: .experiments/state
METRIC_CMD_DEFAULT: "python -m pytest -x --tb=no -q"
GUARD_CMD_DEFAULT: "git diff --stat HEAD"
</constants>
Environment overrides — set these before invoking the skill to override per-variant defaults:
METRIC_CMD— command run inside each variant worktree to measure the ablation metric (default:python -m pytest -x --tb=no -q)GUARD_CMD— command run inside each variant worktree to detect regressions (default:git diff --stat HEAD)STATE_DIR_BASE— base directory for source-run state lookups (default:.developments/fortify-state)FORTIFY_DIR_BASE— base directory for per-run fortify artifacts (default:.developments/fortify)
The constants block defaults above are YAML-only — bash blocks read environment variables (with ${VAR:-default} fallback) and never source values directly from YAML.
CRITICAL: Worktree-based isolation
Do NOT use git checkout -b <branch> for ablations — dirties main working tree, corrupts concurrent tool calls. Each ablation gets own git worktree under $FORTIFY_DIR/worktrees/<variant>, created from best_commit. Main working tree NEVER modified. Cleanup: git worktree remove --force per variant; git worktree prune on interrupt.
Fortify Mode (Steps F1–F8)
Triggered by fortify or fortify <run-id|program.md>.
Task tracking: create tasks for F1, F2, F3, F4, F5, F6, F7, F8 at start — before any tool calls.
Step F1: Locate source run, parse flags, and validate judge approval
Extract flags: --venue <VENUE>, --max-ablations <N>, --skip-run.
Unsupported flag check — after all supported flags extracted, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--venue`, `--max-ablations`, `--skip-run`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.
Input resolution (priority order):
- Explicit
<run-id>argument → read$STATE_DIR_BASE/<run-id>/state.json - Explicit
<program.md>argument → scan$STATE_DIR_BASE/*/state.jsonfor matchingprogram_file, pick latest withstatus: completedorstatus: goal-achieved - No argument → scan
$STATE_DIR_BASE/, pick latest withstatus: completedorstatus: goal-achieved - None found → stop:
fortify: No completed run found. Run /research:run first.
Initialize base directory bash variables (constants YAML is not auto-exported; environment overrides honored):
STATE_DIR_BASE="${STATE_DIR_BASE:-.developments/fortify-state}"
FORTIFY_DIR_BASE="${FORTIFY_DIR_BASE:-.developments/fortify}"
METRIC_CMD="${METRIC_CMD:-python -m pytest -x --tb=no -q}"
GUARD_CMD="${GUARD_CMD:-git diff --stat HEAD}"
Assign $RUN_ID from input resolution above (must be set before guard block uses it):
# Resolve RUN_ID from $ARGUMENTS (first non-flag token) or auto-detect latest completed run
_ARG1=$(echo "$ARGUMENTS" | awk '{print $1}')
if [ -n "$_ARG1" ] && [ "${_ARG1#-}" = "$_ARG1" ] && [ ! -f "$_ARG1" ] && [ -d "$STATE_DIR_BASE/$_ARG1" ]; then
RUN_ID="$_ARG1"
elif [ -n "$_ARG1" ] && [ -f "$_ARG1" ]; then
# program.md path — find latest completed run matching program_file <!-- loads: find_run_id.py -->
RUN_ID=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/find_run_id.py" "$STATE_DIR_BASE" --match-program "$_ARG1" 2>/dev/null)
else
RUN_ID=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/find_run_id.py" "$STATE_DIR_BASE" 2>/dev/null)
fi
[ -z "$RUN_ID" ] && { echo "fortify: No completed run found. Run /research:run first."; exit 1; }
Guard: judge approval required. Judge skill writes verdict to .reports/research/judge-<branch>-<date>.md — scan for APPROVED verdict line:
JUDGE_VERDICT_FILE=$(ls -t .reports/research/judge-*.md 2>/dev/null | head -1) # timeout: 5000
if [ -z "$JUDGE_VERDICT_FILE" ]; then
echo "fortify: BLOCKED — no judge verdict found in .reports/research/."
echo "Ablation studies require an approved baseline. Run: /research:judge <program.md>"
exit 1
fi
# Preserve multi-word verdicts (e.g. "NEEDS REVISION") — strip trailing whitespace only, not internal spaces
JUDGE_VERDICT=$(grep -i '^[*]*[Vv]erdict[*]*:' "$JUDGE_VERDICT_FILE" | head -1 | sed 's/\*\*//g' | sed -E 's/.*[Vv]erdict[: ]+//' | sed 's/[[:space:]]*$//')
# Program cross-match: confirm verdict was issued for the current experiment's program, not a different one
PROGRAM_FILE=$(grep -iE '^[*]*(Program(_file)?|Program file)[*]*:' "$JUDGE_VERDICT_FILE" | head -1 | sed 's/\*\*//g' | sed -E 's/.*:[[:space:]]*//' | sed 's/[[:space:]]*$//')
# Use explicit run-specific state.json path (resolved during F1 input resolution) — never CWD-relative
STATE_PROGRAM=$(jq -r '.program_file // ""' "$STATE_DIR_BASE/$RUN_ID/state.json" 2>/dev/null)
if [ -n "$STATE_PROGRAM" ] && [ -n "$PROGRAM_FILE" ] && [ "$PROGRAM_FILE" != "$STATE_PROGRAM" ]; then
printf "! BLOCKED — judge verdict references program '%s' but current experiment is for '%s'\n" "$PROGRAM_FILE" "$STATE_PROGRAM"
printf "Run: /research:judge %s\n" "$STATE_PROGRAM"
exit 1
fi
# Confirm program file still exists on disk
if [ -n "$PROGRAM_FILE" ] && [ ! -f "$PROGRAM_FILE" ]; then
printf "! BLOCKED — program file %s referenced by judge verdict not found on disk\n" "$PROGRAM_FILE"
exit 1
fi
Verify JUDGE_VERDICT == "APPROVED". The program cross-match above guarantees the verdict was issued for the current experiment — fortify cannot ablate against a different program's verdict. Apply explicit bash gate — prose alone never halts execution:
JUDGE_VERDICT="${JUDGE_VERDICT:-REJECTED}"
if [ "$JUDGE_VERDICT" != "APPROVED" ]; then
echo "fortify: BLOCKED — no APPROVED judge verdict found for this program."
echo "Judge verdict: $JUDGE_VERDICT — stopping before implementation (F1–F7)."
echo "Ablation studies require an approved baseline. Run: /research:judge <program.md>"
exit 1
fi
Note: do NOT infer from
methodology.mdalone —methodology_rating: soundis one input to verdict, not verdict itself. Only## Verdictline in judge output file is authoritative.
Read from state.json: goal, best_metric, best_commit, config (including metric_cmd, guard_cmd, compute), program_file.
Also read baseline_commit — iteration 0 commit from experiments.jsonl (first line, status: "baseline", field "commit").
Pre-compute run directory (each in separate Bash call):
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main') # timeout: 3000
TS=$(date -u +%Y-%m-%dT%H-%M-%SZ) # timeout: 3000
FORTIFY_DIR="$FORTIFY_DIR_BASE/fortify-$TS" # timeout: 5000
STATE_DI