agent-evaluation

Name: agent-evaluation
Rating: 5 (9 reviews)
Author: viktorbezdek

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", "implement LLM-as-judge", "compare model outputs", "mitigate evaluation bias", or mentions multi-dimensional evaluation, agent testing, quality gates, direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated qualit

9stars

Updated 2 months ago

View on GitHub ↗License: MIT

How to add

/plugin marketplace add viktorbezdek/skillstack

The exact command may vary by repository. Check the README on GitHub.

For the skill author

Drop this on your repo README

Shows your skill is listed on Skillteca, generates a backlink and trackable traffic.

[![Listada na Skillteca](https://www.skillteca.com.br/api/badge/agent-evaluation-viktorbezdek/svg)](https://www.skillteca.com.br/skills/agent-evaluation-viktorbezdek?utm_source=badge&utm_medium=readme&utm_campaign=badge)

#llm #ai #test

Related skills

See all in Design e Frontend →

webapp-testing

153.1k

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

Design e Frontend#testby anthropics

brand-guidelines

153.1k

Applies Anthropic's official brand colors and typography to any artifact that may benefit from its look-and-feel. Use it when brand colors, style guidelines, visual formatting, or company design standards apply.

Design e Frontendby anthropics

frontend-design

153.1k

Creates distinctive, production-grade frontend interfaces with high design quality, generating creative, polished code and UI design that avoids generic AI aesthetics. Use for building web components, pages, and applications, or for styling/beautifying web UIs.

Design e Frontend#css#aiby anthropics

mcp-builder

153.1k

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).

Design e Frontend#llm#typescriptby anthropics

Category alert

Get new Design e Frontend skills every Monday

One short email with only the new Design e Frontend skills. 4 minutes of reading, no spam, unsubscribe with one click.

You confirm your email on the first send. No spam. Unsubscribe with one click.

Evaluating LLM Agent Systems

Agent evaluation requires fundamentally different approaches than traditional software testing. Agents make dynamic decisions, are non-deterministic, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known

[Description truncada. Veja o README completo no GitHub.]

ShareX LinkedIn

Comments · No comments

No comments yet. Be the first.