Agent Eval Layer

Add a rubric-based evaluation layer to an existing agent project. Framework-agnostic — works with any agent.

What this skill produces

Rubric (evals/rubrics/main.yaml) — Scoring dimensions with concrete level descriptors and weights
Test cases (evals/test_cases/seed.yaml) — Input/expected-output pairs with difficulty tags (3+ with reference scores for leniency)
Judge prompt (evals/prompts/judge.md) — Structured prompt for LLM-as-a-judge with calibration examples and evidence/suggestion/confidence fields
Eval harness (evals/eval_harness.py) — Runs agent on test cases, sends to judge, aggregates scores + leniency
Reports — Per-subject markdown + (when multi-subject) self-contained HTML dashboard with leaderboard, radar, and per-case heatmap

File Reference

File	Purpose
references/rubric-design.md	Dimension catalog, scale guidance, leniency thresholds, anti-patterns
references/judge-prompts.md	Judge prompt template and calibration techniques
references/framework-adapters.md	Copy-paste recipes per framework (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents, raw Anthropic SDK) — structured output, token + tool-call extraction
references/structured-output-troubleshooting.md	The three Bedrock Opus structured-output errors and the two-stage pattern that fixes them
references/judge-robustness.md	JSON extraction helper, retry-once, defensive score aggregation — drop-in snippets
references/cross-subject-benchmarking.md	Multi-subject (model / framework / prompt) comparison flow
references/html-report-template.html	Self-contained Chart.js dashboard — leaderboard, radar, heatmap
references/interactive-calibration.md	Interactive human-grading dialog for collecting honest `reference_scores` — required for leniency to be meaningful

Metadata contract

Every harness generated by this skill returns the same 7 fields per agent call. Name them exactly this way so cross-project harnesses stay compatible:

{
    "recommendation":  <schema instance or None>,   # agent's structured output
    "latency_ms":      int,                         # wall-clock for the full agent call, including tool execution
    "tool_calls":      int,                         # count of tool invocations the agent initiated
    "input_tokens":    int | None,                  # summed across all turns of this call
    "output_tokens":   int | None,                  # summed across all turns
    "model_id":        str,                         # the model the agent targeted
    "error":           str | None,                  # framework/provider error, or None on success
}

See references/framework-adapters.md for the exact path each framework exposes each field through.

Step 0: Understand the agent

Read the agent's codebase and answer:

What does the agent do? (one sentence)
What are its inputs and outputs?
What tools does it use?
Which framework does it use? (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, raw SDK, custom)
Which model + provider does it target? (Claude on Bedrock changes structured-output strategy — see step 2e)
Is it multi-agent? If yes, what's the pipeline?
What does "good" look like? Ask the user if unclear.

Share findings with the user before proceeding.

Are you benchmarking multiple subjects?

If the user says they want to compare multiple agents, models, frameworks, or prompts against the same task, skip ahead to references/cross-subject-benchmarking.md for the multi-subject flow. The single-subject steps below still apply — you'll just run them N times.

Step 1: Design the rubric

Pick 3-5 dimensions from the catalog in references/rubric-design.md. Common patterns by agent type:

Agent Type	Typical Dimensions
RAG / Q&A	Correctness, Completeness, Faithfulness, Relevance
Task Automation	Task Completion, Efficiency, Error Handling, Safety
Content Generation	Correctness, Completeness, Tone, Engagement
Multi-Agent	Coordination, Final Output Quality, Pipeline Integrity

Rules:

3-5 dimensions. More causes score noise.
Odd scales (3 or 5 points). 3-point for pass/fail-ish, 5-point for nuanced.
Concrete level descriptors. "Good" is not a descriptor. "Correctly addresses the main question but misses important nuances" is.
Weights sum to 1.0. Force-rank by importance.

Present the rubric as a table. Get user confirmation before proceeding.

Step 2: Generate eval artifacts

2a. Rubric (`evals/rubrics/main.yaml`)

name: "agent-name-eval"
version: "1.0"
dimensions:
  - name: correctness
    weight: 0.4
    scale: 5
    levels:
      1: "Output is factually wrong or fails the task entirely"
      2: "Partially correct but contains significant errors"
      3: "Mostly correct with minor errors or omissions"
      4: "Correct with negligible issues"
      5: "Fully correct, complete, and precise"
pass_threshold: 3.5

2b. Test cases (`evals/test_cases/seed.yaml`)

At least 10 cases: 3-4 easy, 3-4 medium, 2-3 hard.

Reference scores on ≥3 easy cases enable leniency tracking. Two modes:

Default (auto-graded) — you generate reference_scores by applying the rubric to the expected_output sketch. Tag with graded_by: claude so the source is transparent. Leniency computed against these is directional only (it catches gross judge drift but can't detect a judge and reference that share the same LLM bias). Good enough for most users.
Opt-in (human-graded) — if the user asks for --calibrate or says they want high-confidence leniency, follow step 2f.

Do NOT silently generate references without the graded_by tag. The user must be able to tell at a glance whether leniency is human-anchored or not.

test_cases:
  - id: "tc-001"
    input: "the exact input to the agent"
    context: "additional context if needed"
    expected_output: "what a good response looks like (a sketch, not ground truth)"
    metadata:
      difficulty: easy
      category: "happy-path"
    reference_scores:
      correctness: 4
      completeness: 4
    reference_metadata:
      graded_by: claude        # or "human" after running step 2f
      graded_at: "2026-04-18T14:23:00Z"

2c. Judge prompt (`evals/prompts/judge.md`)

Use the template from references/judge-prompts.md. Must include:

Full rubric with level descriptors
2-3 calibration examples (clear pass, borderline, clear fail)
Evidence + suggestion + confidence required per dimension (see rubric-design.md "Explainability Fields")
JSON output format with those fields

2d. Eval harness (`evals/eval_harness.py`)

Single-file Python module. Orchestrates: load rubric + test cases → run agent → send to judge → parse scores → aggregate + compute leniency → write report.

Required CLI flags (bake these in — users hit bugs and need them):

--framework NAME      # which subject to run (or "all" for multi-subject)
--test-case ID        # run a single case (essential for debugging adapters)
-v / --verbose        # print per-case metadata
--trials N            # pass@k / variance measurement

The harness should:

Load rubric from YAML and test cases from YAML
Dispatch to the framework adapter — see references/framework-adapters.md
Send (input, output) to the judge LLM with the rubric
Parse the judge's JSON

eval-layer

Como adicionar

Cole no README do seu repo

Skills relacionadas

claude-api

skill-creator

claude-mem

oh-my-issues

Receba novas skills de Desenvolvimento toda segunda

Agent Eval Layer

What this skill produces

File Reference

Metadata contract

Step 0: Understand the agent

Are you benchmarking multiple subjects?

Step 1: Design the rubric

Step 2: Generate eval artifacts

2a. Rubric (`evals/rubrics/main.yaml`)

2b. Test cases (`evals/test_cases/seed.yaml`)

2c. Judge prompt (`evals/prompts/judge.md`)

2d. Eval harness (`evals/eval_harness.py`)

Comentários · Nenhum comentário

Como adicionar

Cole no README do seu repo

Skills relacionadas

claude-api

skill-creator

claude-mem

oh-my-issues

Receba novas skills de Desenvolvimento toda segunda

Agent Eval Layer

What this skill produces

File Reference

Metadata contract

Step 0: Understand the agent

Are you benchmarking multiple subjects?

Step 1: Design the rubric

Step 2: Generate eval artifacts

2a. Rubric (evals/rubrics/main.yaml)

2b. Test cases (evals/test_cases/seed.yaml)

2c. Judge prompt (evals/prompts/judge.md)

2d. Eval harness (evals/eval_harness.py)

Comentários · Nenhum comentário

2a. Rubric (`evals/rubrics/main.yaml`)

2b. Test cases (`evals/test_cases/seed.yaml`)

2c. Judge prompt (`evals/prompts/judge.md`)

2d. Eval harness (`evals/eval_harness.py`)