Evaluating LLM Agent Systems
Agent evaluation requires fundamentally different approaches than traditional software testing. Agents make dynamic decisions, are non-deterministic, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known
[Description truncada. Veja o README completo no GitHub.]