Agent Eval Layer
Add a rubric-based evaluation layer to an existing agent project. Framework-agnostic — works with any agent.
What this skill produces
- Rubric (
evals/rubrics/main.yaml) — Scoring dimensions with concrete level descriptors and weights - Test cases (
evals/test_cases/seed.yaml) — Input/expected-output pairs with difficulty tags (3+ with reference scores for leniency) - Judge prompt (
evals/prompts/judge.md) — Structured prompt for LLM-as-a-judge with calibration examples and evidence/suggestion/confidence fields - Eval harness (
evals/eval_harness.py) — Runs agent on test cases, sends to judge, aggregates scores + leniency - Reports — Per-subject markdown + (when multi-subject) self-contained HTML dashboard with leaderboard, radar, and per-case heatmap
File Reference
| File | Purpose |
|---|---|
| references/rubric-design.md | Dimension catalog, scale guidance, leniency thresholds, anti-patterns |
| references/judge-prompts.md | Judge prompt template and calibration techniques |
| references/framework-adapters.md | Copy-paste recipes per framework (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents, raw Anthropic SDK) — structured output, token + tool-call extraction |
| references/structured-output-troubleshooting.md | The three Bedrock Opus structured-output errors and the two-stage pattern that fixes them |
| references/judge-robustness.md | JSON extraction helper, retry-once, defensive score aggregation — drop-in snippets |
| references/cross-subject-benchmarking.md | Multi-subject (model / framework / prompt) comparison flow |
| references/html-report-template.html | Self-contained Chart.js dashboard — leaderboard, radar, heatmap |
| references/interactive-calibration.md | Interactive human-grading dialog for collecting honest reference_scores — required for leniency to be meaningful |
Metadata contract
Every harness generated by this skill returns the same 7 fields per agent call. Name them exactly this way so cross-project harnesses stay compatible:
{
"recommendation": <schema instance or None>, # agent's structured output
"latency_ms": int, # wall-clock for the full agent call, including tool execution
"tool_calls": int, # count of tool invocations the agent initiated
"input_tokens": int | None, # summed across all turns of this call
"output_tokens": int | None, # summed across all turns
"model_id": str, # the model the agent targeted
"error": str | None, # framework/provider error, or None on success
}
See references/framework-adapters.md for the exact path each framework exposes each field through.
Step 0: Understand the agent
Read the agent's codebase and answer:
- What does the agent do? (one sentence)
- What are its inputs and outputs?
- What tools does it use?
- Which framework does it use? (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, raw SDK, custom)
- Which model + provider does it target? (Claude on Bedrock changes structured-output strategy — see step 2e)
- Is it multi-agent? If yes, what's the pipeline?
- What does "good" look like? Ask the user if unclear.
Share findings with the user before proceeding.
Are you benchmarking multiple subjects?
If the user says they want to compare multiple agents, models, frameworks, or prompts against the same task, skip ahead to references/cross-subject-benchmarking.md for the multi-subject flow. The single-subject steps below still apply — you'll just run them N times.
Step 1: Design the rubric
Pick 3-5 dimensions from the catalog in references/rubric-design.md. Common patterns by agent type:
| Agent Type | Typical Dimensions |
|---|---|
| RAG / Q&A | Correctness, Completeness, Faithfulness, Relevance |
| Task Automation | Task Completion, Efficiency, Error Handling, Safety |
| Content Generation | Correctness, Completeness, Tone, Engagement |
| Multi-Agent | Coordination, Final Output Quality, Pipeline Integrity |
Rules:
- 3-5 dimensions. More causes score noise.
- Odd scales (3 or 5 points). 3-point for pass/fail-ish, 5-point for nuanced.
- Concrete level descriptors. "Good" is not a descriptor. "Correctly addresses the main question but misses important nuances" is.
- Weights sum to 1.0. Force-rank by importance.
Present the rubric as a table. Get user confirmation before proceeding.
Step 2: Generate eval artifacts
2a. Rubric (evals/rubrics/main.yaml)
name: "agent-name-eval"
version: "1.0"
dimensions:
- name: correctness
weight: 0.4
scale: 5
levels:
1: "Output is factually wrong or fails the task entirely"
2: "Partially correct but contains significant errors"
3: "Mostly correct with minor errors or omissions"
4: "Correct with negligible issues"
5: "Fully correct, complete, and precise"
pass_threshold: 3.5
2b. Test cases (evals/test_cases/seed.yaml)
At least 10 cases: 3-4 easy, 3-4 medium, 2-3 hard.
Reference scores on ≥3 easy cases enable leniency tracking. Two modes:
- Default (auto-graded) — you generate
reference_scoresby applying the rubric to theexpected_outputsketch. Tag withgraded_by: claudeso the source is transparent. Leniency computed against these is directional only (it catches gross judge drift but can't detect a judge and reference that share the same LLM bias). Good enough for most users. - Opt-in (human-graded) — if the user asks for
--calibrateor says they want high-confidence leniency, follow step 2f.
Do NOT silently generate references without the graded_by tag. The user must
be able to tell at a glance whether leniency is human-anchored or not.
test_cases:
- id: "tc-001"
input: "the exact input to the agent"
context: "additional context if needed"
expected_output: "what a good response looks like (a sketch, not ground truth)"
metadata:
difficulty: easy
category: "happy-path"
reference_scores:
correctness: 4
completeness: 4
reference_metadata:
graded_by: claude # or "human" after running step 2f
graded_at: "2026-04-18T14:23:00Z"
2c. Judge prompt (evals/prompts/judge.md)
Use the template from references/judge-prompts.md. Must include:
- Full rubric with level descriptors
- 2-3 calibration examples (clear pass, borderline, clear fail)
- Evidence + suggestion + confidence required per dimension (see rubric-design.md "Explainability Fields")
- JSON output format with those fields
2d. Eval harness (evals/eval_harness.py)
Single-file Python module. Orchestrates: load rubric + test cases → run agent → send to judge → parse scores → aggregate + compute leniency → write report.
Required CLI flags (bake these in — users hit bugs and need them):
--framework NAME # which subject to run (or "all" for multi-subject)
--test-case ID # run a single case (essential for debugging adapters)
-v / --verbose # print per-case metadata
--trials N # pass@k / variance measurement
The harness should:
- Load rubric from YAML and test cases from YAML
- Dispatch to the framework adapter — see references/framework-adapters.md
- Send (input, output) to the judge LLM with the rubric
- Parse the judge's JSON