Haiku Pilot — Default-Haiku Execution Playbook
Thesis
Weak planner + strong delegation > strong planner doing everything.
AgentOpt (arxiv 2604.06296): On HotpotQA, Opus alone = 31.71%; Ministral 3 8B planner + Opus solver = 74.27%. Augment Code eval (2026-04): Best-quality AGENTS.md provides performance gain equivalent to one model tier (Haiku → Sonnet effective).
→ Default parent session to Haiku 4.5; aggressively delegate to sub-agents; escalate only on quantitative gate trigger.
"Documentation / process design improvements ≥ model upgrade" is the engineering basis. Blindly upgrading models in multi-stage pipelines often reduces performance (planner doesn't delegate).
Per-Session Pre-flight (run once per task)
1. Reference Pattern Technique
| ❌ Verbal description | ✅ Point to a file |
|---|---|
| "Write a login-like function" | "Follow the pattern in src/auth/login.ts" |
| "Add a hook similar to useUser" | "Pattern: hooks/useUser.ts:42-78" |
Why it matters: Haiku diverges easily from abstract descriptions; pointing to a concrete file = strongest possible context.
2. Think-Before-Coding (compensates Haiku tendency)
Before implementation, output:
- 1–2 sentence task understanding restatement (your interpretation, not parroting)
- Key assumptions (≥ 1, prefixed with "Assumption: ...")
- If multiple interpretations are reasonable → list options for user confirmation, don't pick one yourself
Source:
core.md§ "Think-Before-Coding". Haiku tends to skip this step; this SKILL enforces it.
3. Diff-Review Pledge
Before declaring "complete", run:
git diff --stat
git status
Explain each change in one sentence. Catches Haiku quietly modifying out-of-scope files (surgical-changes discipline).
4. Citation Anchor Enforcement
When citing wiki / docs / spec for numbers, conclusions, or anti-patterns, attach a structured anchor:
(P0X §Y.Z)— paper section ref (e.g.(P03 §4.2)) — preferred for paper-grounded citations(<source> Step N)— wiki step number(anti-pattern #N)— wiki anti-pattern number(<file>:line-range)— code / config line ref
Anchor count requirement (per answer paragraph):
- Easy / medium tasks: ≥ 3 structured anchors
- Hard tasks (architecture / counter-factual / synthesis): ≥ 5 structured anchors —
(P0X §Y.Z)format strongly preferred
If no precise anchor available → tag [unverified]. Never fake a source.
Violation: retry that paragraph; do NOT declare done.
Empirical basis: anchor density discrepancy is the primary driver of citation accuracy variance between Haiku and Sonnet/Opus on benchmark tasks. 2026-05-08 6-agent 10Q benchmark: Haiku-Pilot hard-question anchor density collapsed from 7/100w (medium) to 1/100w (hard), the primary D2 gap (−1.0) vs vanilla Opus. Pre-flight previously did not require ≥ 5 anchors on hard tasks; this rule closes that gap. Per-source-type anchor vocabulary (wiki / code / API doc / RFC / paper): see shared reference
anchor-dictionary.md(suite-internal) if available.
5. Source-Verify Loop (required for citation tasks)
After drafting an answer that cites numbers, model names, or verbatim quotes, run this 3-step loop before declaring done:
- Identify: list every numeric / proper-noun citation in your answer (e.g. "31.71%", "Claude Opus 4.6", "≥ 100 lines").
- Grep: for each citation, run
grep -i "<number-or-name>" <source-path>. If absent → the citation is fabricated. - Record: in the self-check table § "Source-verify" row, mark ✓ only when every citation passes step 2.
| Failure mode caught | Example from v0.2.1 benchmark |
|---|---|
| Cross-paper number mis-attribution | Q06: "5.6× cognitive load" attributed to P06/P07 (actually a benchmark-internal number) |
| Fabricated benchmark scores | Q12: "Opus+HumanLayer=55%, Haiku=40%" — neither in source |
| Non-existent model version | Q13: "Claude 3.5 Opus" — no such Anthropic-published version |
| Invented latency / token claims | Q17: "p99 < 100ms / 500K tokens" — not in P08 |
| Vague paper-grounded targets | Q18: "failure rate ↓ 60–80%" — paper has no such range |
If any cited number fails grep, rewrite that paragraph before completion. Do not ship with [unverified] tags as a workaround on citation tasks; the gate is binary.
Empirical basis: 2026-05-06 v0.2.1 benchmark — 5 of 20 haiku-pilot responses (25%) shipped fabrications, capping Haiku→Opus gap closure at 45.3%. Adding this gate is the highest-ROI lever per gap analysis.
6. Content-First Structure (not Template-First)
Forbidden: filling every question with a fixed sub-heading template; restating in prose what the table already says.
Use instead: the question's nature determines the structure:
- Single decision → bulleted list
- ≥ 3 anti-patterns → 1 combined table (with "trigger / solution / Lab vs Prod" columns), NOT N separate tables
- Multi-option comparison → table lists only "final decision + exclusion rationale", does NOT enumerate all rejected options
- Emoji budget: ≤ 3 per section (✅/❌/🔴); neutral statements get none
Empirical basis: template-first wastes ~30% tokens AND lowers detail density vs content-first.
Per-Task Router (decision table → existing skills)
This router does NOT duplicate the full skill directory. See your workspace's skill resolver / index for the full list.
Task-Type Fast-Path (skip mandatory pre-flight when overhead > value)
| Task type | Required pre-flight | Self-check |
|---|---|---|
| Easy recall (≤ 100w answer; ≤ 5 facts; single source; pure definition / number lookup) | #4 Citation Anchor only (no #2 Think-Before, no #3 Diff-Review) | single-line fast-path: <reason> |
| Wiki / ref citation (multi-paragraph extraction with anchors) | #4 Citation Anchor + #5 Source-Verify | full single-table self-check |
| Code task (any LoC) | #1 Reference Pattern + #2 Think-Before-Coding + #3 Diff-Review | n/a (code task, not citation) |
| Architecture / counter-factual / synthesis | All 6 pre-flight checks | mandatory single-table |
Empirical basis: 2026-05-06 v0.2.1 benchmark Q04 — vanilla Haiku at 51/60 (135w) beat haiku-pilot at 49/60 (182w) on a 4-number recall. Fast-path captures this region by skipping pre-flight overhead on tiny answers.
Coding tasks
| Task | Sub-agent (suggested) | Model |
|---|---|---|
| Single-file ≤ 30 LoC surgical edit | haiku-implementer (or your equivalent) | Haiku 4.5 |
| Cross-module / > 30 LoC / requires design | implementer | Sonnet 4.6 |
| Bug with failing test | bugfix skill (if installed) | Haiku 4.5 |
| Bug without test | debug skill (if installed) | Haiku → Sonnet if stuck |
| Test writing | test-writer | Sonnet 4.6 |
Research tasks
| Task | Sub-agent | Model |
|---|---|---|
| ≥ 10-file fact-finding | researcher | Haiku 4.5 |
| Codebase structure inventory | architecture-explorer | Haiku 4.5 |
Review tasks
| Task | Sub-agent | Model |
|---|---|---|
| Pre-commit multi-dimension review | /deep-review | mixed (parallel) |
| Single-file / single-function review | quick-code-reviewer | Sonnet 4.6 |
| Cross-module architecture / tech selection | reviewer | Opus 4.7 |
| Auth / payment / user-data touchpoints | security-reviewer | Sonnet 4.6 |
| OWASP / proactive threat modeling | security-auditor | Sonnet 4.6 |
Docs tasks
| Task | Sub-agent | Model |
|---|---|---|
| README / CHANGELOG / API docs | doc-writer | Haiku 4.5 |
| Major doc rewrite | doc-writer | Sonnet 4.6 |
Customize for your workspace: replace agent names with the agents that exist in your
.claude/agents/directory. If you don't have specialized agents yet, thegeneral-purposeagent + a model override works