SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

eval-layer

Desenvolvimento

Add rubric-based evaluation to an existing agent codebase. Use when someone asks to add evals, evaluate their agent, measure agent quality, or set up LLM-as-a-judge scoring. Handles single-agent and multi-subject (model/framework/prompt) comparisons.

11estrelas
Ver no GitHub ↗Autor: erezweinstein5

Agent Eval Layer

Add a rubric-based evaluation layer to an existing agent project. Framework-agnostic — works with any agent.

What this skill produces

  1. Rubric (evals/rubrics/main.yaml) — Scoring dimensions with concrete level descriptors and weights
  2. Test cases (evals/test_cases/seed.yaml) — Input/expected-output pairs with difficulty tags (3+ with reference scores for leniency)
  3. Judge prompt (evals/prompts/judge.md) — Structured prompt for LLM-as-a-judge with calibration examples and evidence/suggestion/confidence fields
  4. Eval harness (evals/eval_harness.py) — Runs agent on test cases, sends to judge, aggregates scores + leniency
  5. Reports — Per-subject markdown + (when multi-subject) self-contained HTML dashboard with leaderboard, radar, and per-case heatmap

File Reference

FilePurpose
references/rubric-design.mdDimension catalog, scale guidance, leniency thresholds, anti-patterns
references/judge-prompts.mdJudge prompt template and calibration techniques
references/framework-adapters.mdCopy-paste recipes per framework (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents, raw Anthropic SDK) — structured output, token + tool-call extraction
references/structured-output-troubleshooting.mdThe three Bedrock Opus structured-output errors and the two-stage pattern that fixes them
references/judge-robustness.mdJSON extraction helper, retry-once, defensive score aggregation — drop-in snippets
references/cross-subject-benchmarking.mdMulti-subject (model / framework / prompt) comparison flow
references/html-report-template.htmlSelf-contained Chart.js dashboard — leaderboard, radar, heatmap
references/interactive-calibration.mdInteractive human-grading dialog for collecting honest reference_scores — required for leniency to be meaningful

Metadata contract

Every harness generated by this skill returns the same 7 fields per agent call. Name them exactly this way so cross-project harnesses stay compatible:

{
    "recommendation":  <schema instance or None>,   # agent's structured output
    "latency_ms":      int,                         # wall-clock for the full agent call, including tool execution
    "tool_calls":      int,                         # count of tool invocations the agent initiated
    "input_tokens":    int | None,                  # summed across all turns of this call
    "output_tokens":   int | None,                  # summed across all turns
    "model_id":        str,                         # the model the agent targeted
    "error":           str | None,                  # framework/provider error, or None on success
}

See references/framework-adapters.md for the exact path each framework exposes each field through.


Step 0: Understand the agent

Read the agent's codebase and answer:

  1. What does the agent do? (one sentence)
  2. What are its inputs and outputs?
  3. What tools does it use?
  4. Which framework does it use? (PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, raw SDK, custom)
  5. Which model + provider does it target? (Claude on Bedrock changes structured-output strategy — see step 2e)
  6. Is it multi-agent? If yes, what's the pipeline?
  7. What does "good" look like? Ask the user if unclear.

Share findings with the user before proceeding.

Are you benchmarking multiple subjects?

If the user says they want to compare multiple agents, models, frameworks, or prompts against the same task, skip ahead to references/cross-subject-benchmarking.md for the multi-subject flow. The single-subject steps below still apply — you'll just run them N times.

Step 1: Design the rubric

Pick 3-5 dimensions from the catalog in references/rubric-design.md. Common patterns by agent type:

Agent TypeTypical Dimensions
RAG / Q&ACorrectness, Completeness, Faithfulness, Relevance
Task AutomationTask Completion, Efficiency, Error Handling, Safety
Content GenerationCorrectness, Completeness, Tone, Engagement
Multi-AgentCoordination, Final Output Quality, Pipeline Integrity

Rules:

  • 3-5 dimensions. More causes score noise.
  • Odd scales (3 or 5 points). 3-point for pass/fail-ish, 5-point for nuanced.
  • Concrete level descriptors. "Good" is not a descriptor. "Correctly addresses the main question but misses important nuances" is.
  • Weights sum to 1.0. Force-rank by importance.

Present the rubric as a table. Get user confirmation before proceeding.

Step 2: Generate eval artifacts

2a. Rubric (evals/rubrics/main.yaml)

name: "agent-name-eval"
version: "1.0"
dimensions:
  - name: correctness
    weight: 0.4
    scale: 5
    levels:
      1: "Output is factually wrong or fails the task entirely"
      2: "Partially correct but contains significant errors"
      3: "Mostly correct with minor errors or omissions"
      4: "Correct with negligible issues"
      5: "Fully correct, complete, and precise"
pass_threshold: 3.5

2b. Test cases (evals/test_cases/seed.yaml)

At least 10 cases: 3-4 easy, 3-4 medium, 2-3 hard.

Reference scores on ≥3 easy cases enable leniency tracking. Two modes:

  • Default (auto-graded) — you generate reference_scores by applying the rubric to the expected_output sketch. Tag with graded_by: claude so the source is transparent. Leniency computed against these is directional only (it catches gross judge drift but can't detect a judge and reference that share the same LLM bias). Good enough for most users.
  • Opt-in (human-graded) — if the user asks for --calibrate or says they want high-confidence leniency, follow step 2f.

Do NOT silently generate references without the graded_by tag. The user must be able to tell at a glance whether leniency is human-anchored or not.

test_cases:
  - id: "tc-001"
    input: "the exact input to the agent"
    context: "additional context if needed"
    expected_output: "what a good response looks like (a sketch, not ground truth)"
    metadata:
      difficulty: easy
      category: "happy-path"
    reference_scores:
      correctness: 4
      completeness: 4
    reference_metadata:
      graded_by: claude        # or "human" after running step 2f
      graded_at: "2026-04-18T14:23:00Z"

2c. Judge prompt (evals/prompts/judge.md)

Use the template from references/judge-prompts.md. Must include:

  • Full rubric with level descriptors
  • 2-3 calibration examples (clear pass, borderline, clear fail)
  • Evidence + suggestion + confidence required per dimension (see rubric-design.md "Explainability Fields")
  • JSON output format with those fields

2d. Eval harness (evals/eval_harness.py)

Single-file Python module. Orchestrates: load rubric + test cases → run agent → send to judge → parse scores → aggregate + compute leniency → write report.

Required CLI flags (bake these in — users hit bugs and need them):

--framework NAME      # which subject to run (or "all" for multi-subject)
--test-case ID        # run a single case (essential for debugging adapters)
-v / --verbose        # print per-case metadata
--trials N            # pass@k / variance measurement

The harness should:

  1. Load rubric from YAML and test cases from YAML
  2. Dispatch to the framework adapter — see references/framework-adapters.md
  3. Send (input, output) to the judge LLM with the rubric
  4. Parse the judge's JSON

Como adicionar

/plugin marketplace add erezweinstein5/eval-layer

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.