Eval Runner
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
Commands
run <category/name>
- Read YAML from
.claude/evals/scenarios/<category>/<name>.yml - Parse fields (name, category, task_prompt, success_criteria, budget)
- Execute setup steps if defined
- Record start time
- Execute task via reflexion workflow (read corrections first)
- Record end time and iteration count
- Validate ALL success criteria
- Write result JSON to
.claude/evals/results/<timestamp>-<name>.json - Report summary
run-all [category]
- Glob
.claude/evals/scenarios/**/*.yml - Skip scenarios with
status: retired - For each: run in isolation (git stash), record result, restore
- Update
.claude/evals/pass-history.jsonwith each result - Aggregate and report
run-split <optimization|holdout>
- Glob
.claude/evals/scenarios/**/*.yml - Read each YAML, filter by
splitfield matching the requested set - Skip scenarios with
status: retired - For each matching scenario: run in isolation, record result, restore
- Update
.claude/evals/pass-history.jsonwith each result - Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")
report
- Read all results from
.claude/evals/results/ - Generate summary table:
| Category | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery | ... | ... | ... | |
| delivery | ... | ... | ... | |
| integration | ... | ... | ... | |
| **Overall** | ... | ... | ... | |
- List failure patterns and recommendations
prune
- Read
.claude/evals/pass-history.json - Flag evals where
last_5is all-pass (saturated) or all-fail (broken) - Flag evals with no runs in 30+ days (
stale) - For saturated evals, suggest: retire or increase difficulty
- For broken evals, suggest: fix criteria or retire
- Present recommendations — do NOT auto-retire
- On user confirmation: set
status: retiredin scenario YAML, update pass-history.json, log in .claude/harness/decision-log.md
mine
Analyze audit logs to propose new eval scenarios from observed failure patterns.
- Read
.claude/state/change-log.jsonl(last 100 entries) - Read
.claude/state/diamond-state-audit.jsonl(all entries) - Group change-log entries by
session_id, identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits to.claude/skills/*/SKILL.mdduring a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits - Count diamond bypass entries from audit log
- Cross-reference with existing scenarios in pass-history.json to avoid duplicates
- For each pattern, output a proposed eval scenario as YAML template
- Tag proposals with
source: trace-miningand originatingsession_id - Do NOT auto-create files — present for human review
See .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.
Result Format
See .claude/evals/schema.md for YAML scenario and JSON result formats.
Pass History
After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:
- Increment
runsandpasses(if passed) for the eval - Append
true/falsetolast_5(trim to keep only last 5) - Update
last_runtimestamp
Creating Scenarios
Place YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.