Research-supervisor review of program.md — validates experimental methodology, emits APPROVED / NEEDS-REVISION / BLOCKED verdict before expensive run loop. Read-only; never modifies code or state.

NOT for: running experiments (use /research:run); designing hypotheses (use research:scientist agent); config quality (/foundry:audit (requires foundry plugin)).

</objective> <workflow>

Agent Resolution

_RESEARCH_SHARED=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/resolve_shared.py" 2>/dev/null)  # timeout: 5000

Read $_RESEARCH_SHARED/agent-resolution.md. Contains foundry check + fallback table. If foundry not installed: use table to substitute each foundry:X with general-purpose. Agents: foundry:solution-architect, research:scientist.

Agent	Fallback if absent
`foundry:solution-architect`	`general-purpose` (methodology review quality reduced — ⚠ general-purpose agent may not emit `methodology_rating` in required format; verdict defaults to NEEDS-REVISION)
`research:scientist`	`general-purpose` (scientific rigor review quality reduced — ⚠ general-purpose agent may not emit `scientific_rating`; verdict defaults to NEEDS-REVISION)

Judge Mode (Steps J1–J6)

Triggered by judge or judge <file.md>.

Task tracking: create tasks for J1, J2, J3, J4, J5 (includes J5a + J5b sub-steps), J6 at start — before any tool calls.

Step J1: Locate and parse program.md

Flag parsing (first action):

SKIP_VALIDATION=false
[[ "$ARGUMENTS" == *"--skip-validation"* ]] && SKIP_VALIDATION=true
ARGUMENTS="${ARGUMENTS/--skip-validation/}"  # strip flag from args
ARGUMENTS="${ARGUMENTS#"${ARGUMENTS%%[![:space:]]*}"}"  # trim leading whitespace

Unsupported flag check — after extracting supported flags, scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--skip-validation`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.

Input resolution (priority order):

Explicit argument: /research:judge path/to/plan.md
Auto-detect: program.md at project root
Latest state: scan .experiments/state/*/state.json for most recent with status: running and non-null program_file field

If nothing found: stop with error:

No program.md found. Run /research:plan <goal> first, or provide a path: /research:judge <path.md>

Parsing — find ## <Section> headings in program.md, extract first fenced code block per section, parse as key: value lines, warn on unrecognized keys. --skip-validation and colab_hw judge-specific, extracted independently.

Placeholder substitution — after parsing, apply same substitution as R1: resolve all {field_name} tokens in metric_cmd and guard_cmd using ## Config fields, fallback to declared default. No clarification_prompt in judge — skip clarification-override step.

Extract <program_title> from # Program: <title> line for reports (fallback # Campaign: <title> for legacy files).

Step J2: Completeness audit

Check 12 items. Produce findings list with severity. Each finding has: id, check, status (pass/fail/warn), severity, detail.

ID	Check	Severity if failing	Description
C1	`## Goal` present and non-empty	critical	Campaign cannot run without a goal
C2	`## Metric` has `command` field	critical	No metric = no feedback loop
C3	`## Metric` has `direction` field (higher/lower)	critical	Cannot decide keep/revert without direction
C4	`## Guard` has `command` field	critical	Without guard, regressions go undetected. Note: a command field containing only `echo 0`, `true`, or `exit 0` is equivalent to no guard (always exits 0 regardless of test state) — flag as critical with detail "guard command is a no-op; add real regression detection".
C5	`scope_files` present in `## Config`	high	Without scope, ideation agent modifies arbitrary files
C6	Each `scope_files` path exists on disk (glob match)	high	Non-matching patterns = ideation agent has nothing to work with. If filesystem unavailable, flag `warn` unless path name signals non-existence (e.g., `nonexistent`, `placeholder`, `todo`, `legacy_v1`, `deprecated`, `old`, `removed`).
C7	`target` set in `## Metric`	medium	Without target, campaign runs to max_iterations — may waste compute
C8	`max_iterations` in bounds (1–50)	medium	Missing defaults to 20 (acceptable); >50 violates SKILL.md constants. Additionally: if value is within bounds but >20 AND combined with risk factors (C4 fails / guard empty, OR C6 fails / scope non-existent), add a separate `low` finding: "max_iterations=N is elevated; with no functioning guard/scope, runaway iterations amplify risk — consider reducing to ≤15 until guard/scope is fixed"
C9	`agent_strategy` is valid (`auto`/`perf`/`code`/`ml`/`arch`)	medium	Invalid value silently falls back to `auto`
C10	`compute` is valid (`local`/`colab`/`docker`)	low	Invalid defaults to `local`
C11	`colab_hw` valid (if present)	low	`colab_hw` absent OR is one of `H100, L4, T4, A100, V100, A10G, TPUv2, TPUv3, TPUv4` — fail detail: `"colab_hw '<value>' is not in known set {H100, L4, T4, A100, V100, A10G, TPUv2, TPUv3, TPUv4} — may cause GPU identity check failure in run mode"`. Note: this check is a minimum-capability floor — new Colab hardware tiers may exist beyond this list; unknown values are flagged for user verification, not blocked.
C12	`## Notes` section present	low	Notes optional but improve ideation quality

Scope adequacy sub-rule (C6b) — after C6 passes, assess whether scope_files is sufficient for the stated goal. If the goal type implies known bottleneck locations outside the declared scope, add a medium finding:

Test-speed goal + scope limited to tests/ only → flag: "conftest.py, fixtures, and test infrastructure outside tests/ are common levers for test runtime; scope may be too narrow"
Throughput/latency goal + scope limited to single-layer path (e.g., src/serving/) → flag: "serving bottlenecks often span middleware, connection pooling, or database layers outside declared scope"
Any goal where the stated scope excludes a widely-known dependency class → emit medium finding with location ## Config / scope_files, suggested broader pattern as fix

This is distinct from C6 (path existence) — C6b fires even when the path exists but is likely insufficient.

Severity summary: count findings per severity. Any critical finding = verdict cannot be APPROVED. Enumeration rule: check ALL 12 items before stopping — do not short-circuit after finding the first critical issue. A program.md can have multiple independent flaws across different severity levels; the Required Changes section must list all of them, not just the verdict-determining one.

Placeholder token check (C2, C4 sub-rule) — after confirming command present in ## Metric (C2) and ## Guard (C4), scan each command for {...} tokens. Verify each token's field name exists in ## Config. Token with no matching field = unresolvable — add high finding. Don't flag {field_name} tokens as malformed; valid when resolvable.

Goodhart's Law check (C2b) — after confirming metric command present (C2 passes), assess whether the command operationalizes the stated ## Goal or measures a proxy. If the metric could improve while the actual goal is NOT achieved, add a critical finding:

metric measures test pass rate but goal is latency reduction → critical: "metric is a correctness proxy, not a latency measure"
metric measures lint error count but goal is bug density reduction → critical: "pylint score is a gameable proxy;

judge

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

dev-browser

agent-browser

understand-chat

understand-dashboard

Recibe nuevas skills de Pesquisa e Web todos los lunes

Agent Resolution

Judge Mode (Steps J1–J6)

Step J1: Locate and parse program.md

Step J2: Completeness audit

Comentarios · Sin comentarios