Haiku Pilot — Default-Haiku Execution Playbook

Thesis

Weak planner + strong delegation > strong planner doing everything.

AgentOpt (arxiv 2604.06296): On HotpotQA, Opus alone = 31.71%; Ministral 3 8B planner + Opus solver = 74.27%. Augment Code eval (2026-04): Best-quality AGENTS.md provides performance gain equivalent to one model tier (Haiku → Sonnet effective).

→ Default parent session to Haiku 4.5; aggressively delegate to sub-agents; escalate only on quantitative gate trigger.

"Documentation / process design improvements ≥ model upgrade" is the engineering basis. Blindly upgrading models in multi-stage pipelines often reduces performance (planner doesn't delegate).

Per-Session Pre-flight (run once per task)

1. Reference Pattern Technique

❌ Verbal description	✅ Point to a file
"Write a login-like function"	"Follow the pattern in `src/auth/login.ts`"
"Add a hook similar to useUser"	"Pattern: `hooks/useUser.ts:42-78`"

Why it matters: Haiku diverges easily from abstract descriptions; pointing to a concrete file = strongest possible context.

2. Think-Before-Coding (compensates Haiku tendency)

Before implementation, output:

1–2 sentence task understanding restatement (your interpretation, not parroting)
Key assumptions (≥ 1, prefixed with "Assumption: ...")
If multiple interpretations are reasonable → list options for user confirmation, don't pick one yourself

Source: core.md § "Think-Before-Coding". Haiku tends to skip this step; this SKILL enforces it.

3. Diff-Review Pledge

Before declaring "complete", run:

git diff --stat
git status

Explain each change in one sentence. Catches Haiku quietly modifying out-of-scope files (surgical-changes discipline).

4. Citation Anchor Enforcement

When citing wiki / docs / spec for numbers, conclusions, or anti-patterns, attach a structured anchor:

(P0X §Y.Z) — paper section ref (e.g. (P03 §4.2)) — preferred for paper-grounded citations
(<source> Step N) — wiki step number
(anti-pattern #N) — wiki anti-pattern number
(<file>:line-range) — code / config line ref

Anchor count requirement (per answer paragraph):

Easy / medium tasks: ≥ 3 structured anchors
Hard tasks (architecture / counter-factual / synthesis): ≥ 5 structured anchors — (P0X §Y.Z) format strongly preferred

If no precise anchor available → tag [unverified]. Never fake a source. Violation: retry that paragraph; do NOT declare done.

Empirical basis: anchor density discrepancy is the primary driver of citation accuracy variance between Haiku and Sonnet/Opus on benchmark tasks. 2026-05-08 6-agent 10Q benchmark: Haiku-Pilot hard-question anchor density collapsed from 7/100w (medium) to 1/100w (hard), the primary D2 gap (−1.0) vs vanilla Opus. Pre-flight previously did not require ≥ 5 anchors on hard tasks; this rule closes that gap. Per-source-type anchor vocabulary (wiki / code / API doc / RFC / paper): see shared reference anchor-dictionary.md (suite-internal) if available.

5. Source-Verify Loop (required for citation tasks)

After drafting an answer that cites numbers, model names, or verbatim quotes, run this 3-step loop before declaring done:

Identify: list every numeric / proper-noun citation in your answer (e.g. "31.71%", "Claude Opus 4.6", "≥ 100 lines").
Grep: for each citation, run grep -i "<number-or-name>" <source-path>. If absent → the citation is fabricated.
Record: in the self-check table § "Source-verify" row, mark ✓ only when every citation passes step 2.

Failure mode caught	Example from v0.2.1 benchmark
Cross-paper number mis-attribution	Q06: "5.6× cognitive load" attributed to P06/P07 (actually a benchmark-internal number)
Fabricated benchmark scores	Q12: "Opus+HumanLayer=55%, Haiku=40%" — neither in source
Non-existent model version	Q13: "Claude 3.5 Opus" — no such Anthropic-published version
Invented latency / token claims	Q17: "p99 < 100ms / 500K tokens" — not in P08
Vague paper-grounded targets	Q18: "failure rate ↓ 60–80%" — paper has no such range

If any cited number fails grep, rewrite that paragraph before completion. Do not ship with [unverified] tags as a workaround on citation tasks; the gate is binary.

Empirical basis: 2026-05-06 v0.2.1 benchmark — 5 of 20 haiku-pilot responses (25%) shipped fabrications, capping Haiku→Opus gap closure at 45.3%. Adding this gate is the highest-ROI lever per gap analysis.

6. Content-First Structure (not Template-First)

Forbidden: filling every question with a fixed sub-heading template; restating in prose what the table already says.

Use instead: the question's nature determines the structure:

Single decision → bulleted list
≥ 3 anti-patterns → 1 combined table (with "trigger / solution / Lab vs Prod" columns), NOT N separate tables
Multi-option comparison → table lists only "final decision + exclusion rationale", does NOT enumerate all rejected options
Emoji budget: ≤ 3 per section (✅/❌/🔴); neutral statements get none

Empirical basis: template-first wastes ~30% tokens AND lowers detail density vs content-first.

Per-Task Router (decision table → existing skills)

This router does NOT duplicate the full skill directory. See your workspace's skill resolver / index for the full list.

Task-Type Fast-Path (skip mandatory pre-flight when overhead > value)

Task type	Required pre-flight	Self-check
Easy recall (≤ 100w answer; ≤ 5 facts; single source; pure definition / number lookup)	#4 Citation Anchor only (no #2 Think-Before, no #3 Diff-Review)	single-line `fast-path: <reason>`
Wiki / ref citation (multi-paragraph extraction with anchors)	#4 Citation Anchor + #5 Source-Verify	full single-table self-check
Code task (any LoC)	#1 Reference Pattern + #2 Think-Before-Coding + #3 Diff-Review	n/a (code task, not citation)
Architecture / counter-factual / synthesis	All 6 pre-flight checks	mandatory single-table

Empirical basis: 2026-05-06 v0.2.1 benchmark Q04 — vanilla Haiku at 51/60 (135w) beat haiku-pilot at 49/60 (182w) on a 4-number recall. Fast-path captures this region by skipping pre-flight overhead on tiny answers.

Coding tasks

Task	Sub-agent (suggested)	Model
Single-file ≤ 30 LoC surgical edit	`haiku-implementer` (or your equivalent)	Haiku 4.5
Cross-module / > 30 LoC / requires design	`implementer`	Sonnet 4.6
Bug with failing test	`bugfix` skill (if installed)	Haiku 4.5
Bug without test	`debug` skill (if installed)	Haiku → Sonnet if stuck
Test writing	`test-writer`	Sonnet 4.6

Research tasks

Task	Sub-agent	Model
≥ 10-file fact-finding	`researcher`	Haiku 4.5
Codebase structure inventory	`architecture-explorer`	Haiku 4.5

Review tasks

Task	Sub-agent	Model
Pre-commit multi-dimension review	`/deep-review`	mixed (parallel)
Single-file / single-function review	`quick-code-reviewer`	Sonnet 4.6
Cross-module architecture / tech selection	`reviewer`	Opus 4.7
Auth / payment / user-data touchpoints	`security-reviewer`	Sonnet 4.6
OWASP / proactive threat modeling	`security-auditor`	Sonnet 4.6

Docs tasks

Task	Sub-agent	Model
README / CHANGELOG / API docs	`doc-writer`	Haiku 4.5
Major doc rewrite	`doc-writer`	Sonnet 4.6

Customize for your workspace: replace agent names with the agents that exist in your .claude/agents/ directory. If you don't have specialized agents yet, the general-purpose agent + a model override works

haiku-pilot

How to add

Drop this on your repo README

Related skills

Fabric

task-intelligence

task-intelligence

task-intelligence

Get new Produtividade skills every Monday

Haiku Pilot — Default-Haiku Execution Playbook

Thesis

Per-Session Pre-flight (run once per task)

1. Reference Pattern Technique

2. Think-Before-Coding (compensates Haiku tendency)

3. Diff-Review Pledge

4. Citation Anchor Enforcement

5. Source-Verify Loop (required for citation tasks)

6. Content-First Structure (not Template-First)

Per-Task Router (decision table → existing skills)

Task-Type Fast-Path (skip mandatory pre-flight when overhead > value)

Coding tasks

Research tasks

Review tasks

Docs tasks

Comments · No comments