Validate agents and skills by measuring outputs against synthetic problems with defined ground truth. Primary signal: calibration bias — gap between self-reported confidence and actual recall. Well-calibrated agent reports 0.9 when it finds ~90% of issues. Miscalibrated: reports 0.9, finds 60%.

Calibration data drives improvement loop: systematic gaps → instruction updates; persistent overconfidence → adjusted re-run thresholds in MEMORY.md.

NOT for: static routing overlap analysis (use /foundry:audit); manually reviewing skill output quality (use /develop:review (requires develop plugin)).

</objective> <inputs>

$ARGUMENTS: parse --flags first, then resolve remaining tokens as scope targets

Flags (order independent):
- --fast — 3 problems per target (default when neither pace flag passed)
- --full — 10 problems per target; mutually exclusive with --fast
- --ab-test — also run general-purpose baseline and report delta metrics; requires benchmark (default --fast if no pace flag); mutually exclusive with --apply
- --apply — apply proposals: with --fast/--full: run benchmark then immediately apply; without pace flag: skip benchmark, apply proposals from most recent past run; mutually exclusive with --ab-test
- --skip-gate — suppress follow-up gate; for programmatic callers
- --local — resolve target agent/skill files from source tree (plugins/*/) instead of installed plugin cache; for plugin-dev workflows where local edits aren't yet installed; sets LOCAL_MODE=true in all pipeline spawns
Mutual exclusion validation (check before any work):
- --ab-test + --apply together → hard error: "--ab-test and --apply are mutually exclusive. Pass one or neither."
- --fast + --full together → hard error: "Pass --fast or --full, not both."
- --ab-test without pace flag → default --fast silently (no error)
Unsupported flag check — after all supported flags extracted (--fast, --full, --ab-test, --apply, --skip-gate, --local), scan $ARGUMENTS for remaining --<token> tokens. If found: print ! Unknown flag(s): \--<token>`. Supported: `--fast`, `--full`, `--ab-test`, `--apply`, `--skip-gate`, `--local`.then invokeAskUserQuestion` — (a) Abort (stop, re-invoke with correct flags) · (b) Continue ignoring (skip unknown flags, proceed). On Abort: stop.

Legacy positional tokens (ab, apply, fast, full) — hard error: print migration hint and stop. Example: "ab removed — use --ab-test flag: /calibrate curator --ab-test."

Scope tokens (positional, space-separated — defaults to all):
- all — all agents + relevant skills + routing + communication + all rules
- agents — all agents only (full agent list in modes/agents.md)
- skills — calibratable skills only (/audit and others per modes/skills.md; /oss:review (requires oss plugin) excluded — requires live GitHub PR)
- routing — routing accuracy test: measures how accurately general-purpose orchestrator selects correct subagent_type for synthetic task prompts (not per-agent quality benchmark; included in all)
- communication — handover + team protocol compliance: runs foundry:curator against synthetic agent responses and team transcripts with injected protocol violations (missing JSON envelope, missing summary, AgentSpeak v2 breaches); included in all
- rules — rule adherence test: for each global rule file (no paths:) and each path-scoped rule when matching file is in context, generates synthetic tasks that should trigger rule's key directives, measures whether general-purpose agent with rule loaded correctly applies them; reports rules that are ignored, misapplied, or redundant; included in all
- plugins — all agents + calibratable skills from all plugins/*/ directories (union of all plugin-namespaced agents and calibratable skills)
- <plugin-name> — tier 2: bare plugin directory name (e.g. oss, foundry, research, develop) auto-resolved when token matches plugins/<name>/ directory; calibrates all agents + calibratable skills in that plugin
- <agent-name> — tier 3: single agent (e.g., foundry:sw-engineer); also accepts bare name (e.g. sw-engineer) and resolves via plugins/*/agents/<name>.md
- /foundry:audit — single skill (pass any calibratable skill name; /oss:review (requires oss plugin) accepted but excluded per modes/skills.md)
- Multiple scope tokens — space-separated; calibrates union of resolved targets: oss research, agents skills, curator shepherd; each token resolved through same tier hierarchy as /audit scope tokens (reserved keywords first, then plugin-dir lookup, then agent/skill file search)
Every invocation surfaces report: benchmark runs print new results; --apply without pace flag prints saved report from last run before applying.

</inputs> <constants>

FAST_N: 3 problems per target
FULL_N: 10 problems per target
RECALL_THRESHOLD: 0.70 (below → agent needs instruction improvement)
CALIBRATION_BORDERLINE: ±0.10 (|bias| within this → calibrated; between 0.10 and 0.15 → borderline)
CALIBRATION_WARN: ±0.15 (bias beyond this → confidence decoupled from quality)
CALIBRATE_LOG: .claude/logs/calibrations.jsonl
AB_ADVANTAGE_THRESHOLD: 0.10 (delta recall or F1 above this → meaningful advantage; below → marginal or none)
PHASE_TIMEOUT_MIN: 5 (per-phase budget — if spawned subagents haven't all returned, collect partial results and continue)
PIPELINE_TIMEOUT_MIN: 10 (hard cutoff — pipeline not notified within 10 min of launch is timed out; extendable if agent explains delay) # tighter than global 15-min cutoff from CLAUDE.md §6 — intentional for calibrate
HEALTH_CHECK_INTERVAL_MIN: 5 (orchestrator polls each running pipeline every 5 min for liveness) # = global default (CLAUDE.md §6)
EXTENSION_MIN: 5 # = global default (CLAUDE.md §6)
PIPELINE_BATCH_SIZE: 5 (max agent/skill pipeline subagents spawned concurrently within one mode — prevents agent count explosion on all; batch: spawn ≤5, wait for all results, then spawn next batch)
ROUTING_ACCURACY_THRESHOLD: 0.90 (below → agent descriptions need improvement) # keep in sync with modes/routing.md
ROUTING_HARD_THRESHOLD: 0.80 (below → high-overlap pair descriptions need disambiguation)

PROBLEM_SET_VERSION: 1.0
CODEX_PROBLEM_RATIO: 0.6 (fraction of in-scope problems generated by Codex — agents/skills modes only)
CODEX_SCORER_WEIGHT: 0.49 (Codex scorer weight; Claude = 0.51 — Claude has last word on disagreements)
SCORER_AGREEMENT_WARN: 0.70 (scorer agreement below this → flag ambiguous ground truth ⚠)
CODEX_MODES: ["agents", "skills"] (modes where Codex is active; routing/communication/rules excluded — test Claude-specific internals)
PIPELINE_TIMEOUT_MIN_DUAL: 15 (hard cutoff when Codex active — replaces PIPELINE_TIMEOUT_MIN=10 for dual-source runs)

Domain tables per mode: see modes/agents.md, modes/skills.md, modes/routing.md, modes/communication.md, modes/rules.md.

</constants> <workflow>

Task hygiene:

# audit-skip: resilience-replication — _FS resolver intentionally duplicated across foundry skills (cannot self-locate)
_FS=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/foundry}/bin/resolve_shared_path.py" foundry skills/_shared 2>/dev/null || echo "plugins/foundry/skills/_shared")  # timeout: 5000

Read $_FS/task-hygiene.md — follow task hygiene protocol.

Task tracking: create tasks at start of execution (Step 1) for

calibrate

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

claude-api

skill-creator

oh-my-issues

claude-mem

Recibe nuevas skills de Desenvolvimento todos los lunes

Comentarios · Sin comentarios