Validate agents and skills by measuring outputs against synthetic problems with defined ground truth. Primary signal: calibration bias — gap between self-reported confidence and actual recall. Well-calibrated agent reports 0.9 when it finds ~90% of issues. Miscalibrated: reports 0.9, finds 60%.
Calibration data drives improvement loop: systematic gaps → instruction updates; persistent overconfidence → adjusted re-run thresholds in MEMORY.md.
NOT for: static routing overlap analysis (use /foundry:audit); manually reviewing skill output quality (use /develop:review (requires develop plugin)).
-
$ARGUMENTS: parse
--flagsfirst, then resolve remaining tokens as scope targetsFlags (order independent):
--fast— 3 problems per target (default when neither pace flag passed)--full— 10 problems per target; mutually exclusive with--fast--ab-test— also rungeneral-purposebaseline and report delta metrics; requires benchmark (default--fastif no pace flag); mutually exclusive with--apply--apply— apply proposals: with--fast/--full: run benchmark then immediately apply; without pace flag: skip benchmark, apply proposals from most recent past run; mutually exclusive with--ab-test--skip-gate— suppress follow-up gate; for programmatic callers--local— resolve target agent/skill files from source tree (plugins/*/) instead of installed plugin cache; for plugin-dev workflows where local edits aren't yet installed; setsLOCAL_MODE=truein all pipeline spawns
Mutual exclusion validation (check before any work):
--ab-test+--applytogether → hard error: "--ab-testand--applyare mutually exclusive. Pass one or neither."--fast+--fulltogether → hard error: "Pass--fastor--full, not both."--ab-testwithout pace flag → default--fastsilently (no error)
Unsupported flag check — after all supported flags extracted (
--fast,--full,--ab-test,--apply,--skip-gate,--local), scan$ARGUMENTSfor remaining--<token>tokens. If found: print! Unknown flag(s): \--<token>`. Supported: `--fast`, `--full`, `--ab-test`, `--apply`, `--skip-gate`, `--local`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.Legacy positional tokens (
ab,apply,fast,full) — hard error: print migration hint and stop. Example: "abremoved — use--ab-testflag:/calibrate curator --ab-test."Scope tokens (positional, space-separated — defaults to
all):all— all agents + relevant skills + routing + communication + all rulesagents— all agents only (full agent list inmodes/agents.md)skills— calibratable skills only (/auditand others permodes/skills.md;/oss:review(requiresossplugin) excluded — requires live GitHub PR)routing— routing accuracy test: measures how accuratelygeneral-purposeorchestrator selects correctsubagent_typefor synthetic task prompts (not per-agent quality benchmark; included inall)communication— handover + team protocol compliance: runsfoundry:curatoragainst synthetic agent responses and team transcripts with injected protocol violations (missing JSON envelope, missingsummary, AgentSpeak v2 breaches); included inallrules— rule adherence test: for each global rule file (nopaths:) and each path-scoped rule when matching file is in context, generates synthetic tasks that should trigger rule's key directives, measures whethergeneral-purposeagent with rule loaded correctly applies them; reports rules that are ignored, misapplied, or redundant; included inallplugins— all agents + calibratable skills from allplugins/*/directories (union of all plugin-namespaced agents and calibratable skills)<plugin-name>— tier 2: bare plugin directory name (e.g.oss,foundry,research,develop) auto-resolved when token matchesplugins/<name>/directory; calibrates all agents + calibratable skills in that plugin<agent-name>— tier 3: single agent (e.g.,foundry:sw-engineer); also accepts bare name (e.g.sw-engineer) and resolves viaplugins/*/agents/<name>.md/foundry:audit— single skill (pass any calibratable skill name;/oss:review(requiresossplugin) accepted but excluded permodes/skills.md)- Multiple scope tokens — space-separated; calibrates union of resolved targets:
oss research,agents skills,curator shepherd; each token resolved through same tier hierarchy as/auditscope tokens (reserved keywords first, then plugin-dir lookup, then agent/skill file search)
Every invocation surfaces report: benchmark runs print new results;
--applywithout pace flag prints saved report from last run before applying.
- FAST_N: 3 problems per target
- FULL_N: 10 problems per target
- RECALL_THRESHOLD: 0.70 (below → agent needs instruction improvement)
- CALIBRATION_BORDERLINE: ±0.10 (|bias| within this → calibrated; between 0.10 and 0.15 → borderline)
- CALIBRATION_WARN: ±0.15 (bias beyond this → confidence decoupled from quality)
- CALIBRATE_LOG:
.claude/logs/calibrations.jsonl - AB_ADVANTAGE_THRESHOLD: 0.10 (delta recall or F1 above this → meaningful advantage; below → marginal or none)
- PHASE_TIMEOUT_MIN: 5 (per-phase budget — if spawned subagents haven't all returned, collect partial results and continue)
- PIPELINE_TIMEOUT_MIN: 10 (hard cutoff — pipeline not notified within 10 min of launch is timed out; extendable if agent explains delay) # tighter than global 15-min cutoff from CLAUDE.md §6 — intentional for calibrate
- HEALTH_CHECK_INTERVAL_MIN: 5 (orchestrator polls each running pipeline every 5 min for liveness) # = global default (CLAUDE.md §6)
- EXTENSION_MIN: 5 # = global default (CLAUDE.md §6)
- PIPELINE_BATCH_SIZE: 5 (max agent/skill pipeline subagents spawned concurrently within one mode — prevents agent count explosion on
all; batch: spawn ≤5, wait for all results, then spawn next batch) - ROUTING_ACCURACY_THRESHOLD: 0.90 (below → agent descriptions need improvement) # keep in sync with modes/routing.md
- ROUTING_HARD_THRESHOLD: 0.80 (below → high-overlap pair descriptions need disambiguation)
- PROBLEM_SET_VERSION: 1.0
- CODEX_PROBLEM_RATIO: 0.6 (fraction of in-scope problems generated by Codex — agents/skills modes only)
- CODEX_SCORER_WEIGHT: 0.49 (Codex scorer weight; Claude = 0.51 — Claude has last word on disagreements)
- SCORER_AGREEMENT_WARN: 0.70 (scorer agreement below this → flag ambiguous ground truth ⚠)
- CODEX_MODES: ["agents", "skills"] (modes where Codex is active; routing/communication/rules excluded — test Claude-specific internals)
- PIPELINE_TIMEOUT_MIN_DUAL: 15 (hard cutoff when Codex active — replaces PIPELINE_TIMEOUT_MIN=10 for dual-source runs)
Domain tables per mode: see modes/agents.md, modes/skills.md, modes/routing.md, modes/communication.md, modes/rules.md.
Task hygiene:
# audit-skip: resilience-replication — _FS resolver intentionally duplicated across foundry skills (cannot self-locate)
_FS=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/foundry}/bin/resolve_shared_path.py" foundry skills/_shared 2>/dev/null || echo "plugins/foundry/skills/_shared") # timeout: 5000
Read $_FS/task-hygiene.md — follow task hygiene protocol.
Task tracking: create tasks at start of execution (Step 1) for