Experiment Set Design
Core Principle
Compliance doesn't prove quality. Quality requires blind comparison against baseline.
The Baseline Requirement
NON-NEGOTIABLE: Every experiment needs a baseline -- same prompt, same task, no skill loaded. No baseline = invalid experiment.
Task Selection Checklist
Each task must satisfy ALL of:
- Skill's guidance should produce different behavior on this task
- Complex enough that architecture decisions matter (not a one-l
[Description truncada. Veja o README completo no GitHub.]