Research-supervisor review of program.md — validates experimental methodology, emits APPROVED / NEEDS-REVISION / BLOCKED verdict before expensive run loop. Read-only; never modifies code or state.
NOT for: running experiments (use /research:run); designing hypotheses (use research:scientist agent); config quality (/foundry:audit (requires foundry plugin)).
Agent Resolution
_RESEARCH_SHARED=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/resolve_shared.py" 2>/dev/null) # timeout: 5000
Read $_RESEARCH_SHARED/agent-resolution.md. Contains foundry check + fallback table. If foundry not installed: use table to substitute each foundry:X with general-purpose. Agents: foundry:solution-architect, research:scientist.
| Agent | Fallback if absent |
|---|---|
foundry:solution-architect | general-purpose (methodology review quality reduced — ⚠ general-purpose agent may not emit methodology_rating in required format; verdict defaults to NEEDS-REVISION) |
research:scientist | general-purpose (scientific rigor review quality reduced — ⚠ general-purpose agent may not emit scientific_rating; verdict defaults to NEEDS-REVISION) |
Judge Mode (Steps J1–J6)
Triggered by judge or judge <file.md>.
Task tracking: create tasks for J1, J2, J3, J4, J5 (includes J5a + J5b sub-steps), J6 at start — before any tool calls.
Step J1: Locate and parse program.md
Flag parsing (first action):
SKIP_VALIDATION=false
[[ "$ARGUMENTS" == *"--skip-validation"* ]] && SKIP_VALIDATION=true
ARGUMENTS="${ARGUMENTS/--skip-validation/}" # strip flag from args
ARGUMENTS="${ARGUMENTS#"${ARGUMENTS%%[![:space:]]*}"}" # trim leading whitespace
Unsupported flag check — after extracting supported flags, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--skip-validation`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.
Input resolution (priority order):
- Explicit argument:
/research:judge path/to/plan.md - Auto-detect:
program.mdat project root - Latest state: scan
.experiments/state/*/state.jsonfor most recent withstatus: runningand non-nullprogram_filefield - If nothing found: stop with error:
No program.md found. Run /research:plan <goal> first, or provide a path: /research:judge <path.md>
Parsing — find ## <Section> headings in program.md, extract first fenced code block per section, parse as key: value lines, warn on unrecognized keys. --skip-validation and colab_hw judge-specific, extracted independently.
Placeholder substitution — after parsing, apply same substitution as R1: resolve all {field_name} tokens in metric_cmd and guard_cmd using ## Config fields, fallback to declared default. No clarification_prompt in judge — skip clarification-override step.
Extract <program_title> from # Program: <title> line for reports (fallback # Campaign: <title> for legacy files).
Step J2: Completeness audit
Check 12 items. Produce findings list with severity. Each finding has: id, check, status (pass/fail/warn), severity, detail.
| ID | Check | Severity if failing | Description |
|---|---|---|---|
| C1 | ## Goal present and non-empty | critical | Campaign cannot run without a goal |
| C2 | ## Metric has command field | critical | No metric = no feedback loop |
| C3 | ## Metric has direction field (higher/lower) | critical | Cannot decide keep/revert without direction |
| C4 | ## Guard has command field | critical | Without guard, regressions go undetected. Note: a command field containing only echo 0, true, or exit 0 is equivalent to no guard (always exits 0 regardless of test state) — flag as critical with detail "guard command is a no-op; add real regression detection". |
| C5 | scope_files present in ## Config | high | Without scope, ideation agent modifies arbitrary files |
| C6 | Each scope_files path exists on disk (glob match) | high | Non-matching patterns = ideation agent has nothing to work with. If filesystem unavailable, flag warn unless path name signals non-existence (e.g., nonexistent, placeholder, todo, legacy_v1, deprecated, old, removed). |
| C7 | target set in ## Metric | medium | Without target, campaign runs to max_iterations — may waste compute |
| C8 | max_iterations in bounds (1–50) | medium | Missing defaults to 20 (acceptable); >50 violates SKILL.md constants. Additionally: if value is within bounds but >20 AND combined with risk factors (C4 fails / guard empty, OR C6 fails / scope non-existent), add a separate low finding: "max_iterations=N is elevated; with no functioning guard/scope, runaway iterations amplify risk — consider reducing to ≤15 until guard/scope is fixed" |
| C9 | agent_strategy is valid (auto/perf/code/ml/arch) | medium | Invalid value silently falls back to auto |
| C10 | compute is valid (local/colab/docker) | low | Invalid defaults to local |
| C11 | colab_hw valid (if present) | low | colab_hw absent OR is one of H100, L4, T4, A100, V100, A10G, TPUv2, TPUv3, TPUv4 — fail detail: "colab_hw '<value>' is not in known set {H100, L4, T4, A100, V100, A10G, TPUv2, TPUv3, TPUv4} — may cause GPU identity check failure in run mode". Note: this check is a minimum-capability floor — new Colab hardware tiers may exist beyond this list; unknown values are flagged for user verification, not blocked. |
| C12 | ## Notes section present | low | Notes optional but improve ideation quality |
Scope adequacy sub-rule (C6b) — after C6 passes, assess whether scope_files is sufficient for the stated goal. If the goal type implies known bottleneck locations outside the declared scope, add a medium finding:
- Test-speed goal + scope limited to
tests/only → flag: "conftest.py, fixtures, and test infrastructure outside tests/ are common levers for test runtime; scope may be too narrow" - Throughput/latency goal + scope limited to single-layer path (e.g.,
src/serving/) → flag: "serving bottlenecks often span middleware, connection pooling, or database layers outside declared scope" - Any goal where the stated scope excludes a widely-known dependency class → emit medium finding with location
## Config / scope_files, suggested broader pattern as fix
This is distinct from C6 (path existence) — C6b fires even when the path exists but is likely insufficient.
Severity summary: count findings per severity. Any critical finding = verdict cannot be APPROVED. Enumeration rule: check ALL 12 items before stopping — do not short-circuit after finding the first critical issue. A program.md can have multiple independent flaws across different severity levels; the Required Changes section must list all of them, not just the verdict-determining one.
Placeholder token check (C2, C4 sub-rule) — after confirming command present in ## Metric (C2) and ## Guard (C4), scan each command for {...} tokens. Verify each token's field name exists in ## Config. Token with no matching field = unresolvable — add high finding. Don't flag {field_name} tokens as malformed; valid when resolvable.
Goodhart's Law check (C2b) — after confirming metric command present (C2 passes), assess whether the command operationalizes the stated ## Goal or measures a proxy. If the metric could improve while the actual goal is NOT achieved, add a critical finding:
- metric measures test pass rate but goal is latency reduction → critical: "metric is a correctness proxy, not a latency measure"
- metric measures lint error count but goal is bug density reduction → critical: "pylint score is a gameable proxy;