Eval Pipeline
Skill for measuring and comparing AI agent and model performance systematically.
When to use this skill
- You want to know if a prompt change actually improved results
- You need to compare two versions of an agent
- You want to catch regressions before deploying changes
- You need to report on agent quality with concrete metrics
- You are building a dataset to fine-tune or test a model
Evaluation process
1. Define what you are measuring
Pick the right metric for
[Description truncada. Veja o README completo no GitHub.]