Skill Evaluation Workbench
When To Use
- A skill or prompt needs repeatable quality checks across models or configurations.
- A workflow needs file-based graders, command traces, or local artifact checks.
- A tool or MCP skill needs a hidden service fixture or sandboxed test workspace.
- A previous agent attempt failed and you need trace-driven diagnosis before editing instructions.
Requirements / Checks
- Confirm an eval runner exists locally before running anything. Do not install de
[Description truncada. Veja o README completo no GitHub.]