Sonnet Pilot — Quality-First Execution Playbook
Thesis
Deep context engineering + precise delegation = quality floor protection + predictable escalation.
Augment Code eval (2026-04): Best-quality AGENTS.md provides performance gain equivalent to one model tier (Haiku→Sonnet or Sonnet→Opus effective). Bad documentation is worse than no documentation, by ~30%. AgentOpt (arxiv 2604.06296): On HotpotQA, strongest planner alone (Opus full stack) = 31.71%; weak planner + strong solver (layered delegation) = 74.27%. Individual model capability does NOT predict ensemble performance.
What this SKILL actually does (empirically validated, A/B/C benchmark, n=5 wiki tasks, Opus 4.7 graded):
- Sonnet + SKILL = 283/300; Sonnet Plain = 283/300; Opus = 286/300
- SKILL net gain on routine extraction = 0: on structured wiki tasks, Sonnet already performs near-Opus without this playbook
- SKILL value = floor protection on hard tasks: complex tasks with ambiguous requirements, multi-step agentic, cross-module design — where Sonnet without structure tends to miss assumptions, skip self-review, or produce over-engineered output
→ Think of this SKILL as insurance: no observable premium on easy days; prevents 10–30% quality loss on hard days. → If your tasks are mostly "extract facts from a structured document" → you may not see measurable gain; activate this for complex implementation and agentic tasks. → Sonnet as planner + advisor()/reviewer (Opus) on-demand > Opus full-stack for most tasks.
Cost ratio: Sonnet ($3/$15) vs Opus ($15/$75). Quality first does NOT mean always-Opus; Opus only fires when quantitative gates trigger.
Per-Session Pre-flight (run once per session)
1. Reference Pattern Technique
| ❌ Abstract description | ✅ Point to a file |
|---|---|
| "Write a validation module" | "Follow the pattern in src/auth/validator.ts" |
| "Add a hook similar to useUser" | "Pattern: hooks/useUser.ts:42-78" |
| "Reference the deploy flow we used last time" | "Follow <your-playbook>.md Step 3-6" |
Sonnet-specific: Sonnet given an abstract description tends to design from scratch; concrete reference = narrows design space to "variant of correct pattern", not "invent from zero".
2. Decision-Log Requirement (Opus has this built-in; Sonnet must enforce)
For every non-obvious technical decision, before completion output 2 lines:
Choice: <what you decided>
Rejected: <option considered but excluded> — Reason: <one sentence>
Why it matters: Opus naturally does this in reasoning; Sonnet tends to skip the intermediate decision record and emit results directly, making reasoning quality undiffable and review hard. Decision-log is a synthetic substitute for Opus's deeper thinking, externalized.
3. Reasoning Chain Before Code
Before implementation, output (don't skip):
- 1-2 sentence task understanding (your interpretation, not parroting)
- Key assumptions (≥ 1, prefixed with "Assumption: ...")
- Implementation plan (bullet list, ≤ 5 steps)
- If multiple reasonable interpretations exist → list options for user confirmation, don't pick one yourself
Source:
core.md§ "Think-Before-Coding". Sonnet has higher confidence than Haiku, paradoxically MORE prone to skipping assumption verification; this pre-flight enforces it.
4. Self-Review Loop (quality gate)
Before declaring "complete":
git diff --stat # confirm change scope
git status # confirm no unintended additions
Explain each modified file in one sentence. Then:
- Any change > 30 LoC → dispatch
quick-code-reviewersub-agent for post-implementation review - Touches auth/payment/user-data → dispatch
security-reviewer - Architecture or cross-module design → call
advisor()before declaring done
Sonnet-specific: Sonnet's high confidence is a double-edged sword — easy to skip self-review. This pledge is the externally bolted-on "doubt mechanism".
5. Intermediate Checkpoint (multi-step tasks)
For tasks with ≥ 3 independent steps, pause and verify after each:
- Output: "Completed Step N / M: <one-sentence result>"
- Confirm intermediate output matches expectation before proceeding
- If anything unexpected → stop and ask, don't power through
Empirical basis: AgentOpt experiments show step roles in multi-step pipelines matter more than overall model capability; Sonnet's failure mode is "reasoning without verification checkpoints".
6. Mid-Write Outline Verification (architecture / counter-factual / synthesis tasks)
Trigger: tasks tagged Architecture / Counter-factual / Synthesis (per § Task-Type Fast-Path table).
Procedure (run once at ~200 words drafted, before continuing):
| Check | Action on fail |
|---|---|
| Still inside the question's scope? | Drop tangential paragraphs |
| Projected total ≤ 800w? (current_wc × est_remaining_sections) | Cut outline; merge sections |
| All N mandatory citations / subquestions covered? (list them; checkmark) | Add missing items first; do not write more body until covered |
Why this exists: Pre-flight #4 (Self-Review Loop, git diff --stat) fires after the answer is written. By that point, runaway answers are already 700–800 words. Mid-write checkpoint catches scope drift / missed mandatory citations while still cheap to fix.
Empirical basis: 2026-05-06 v0.2.1 benchmark Q18 — sonnet-pilot wrote 750 words on a CLAUDE.md refactor plan but omitted 2 of 3 required verbatim numbers (−28.64%, #33→#5). A 200-word checkpoint listing required citations would have surfaced the omission with 550 words still to write. Net loss: −6 vs Opus baseline.
Skip when: Easy recall, Wiki/ref extraction, or Code implementation tasks (where the gate adds overhead with no expected catch).
7. Source-Verify (Line-Level, required for paper / wiki / spec citation tasks)
After drafting any answer that cites numbers, model names, verbatim quotes, or paper conclusions, run this 4-step loop before declaring done:
- Identify: list every numeric / proper-noun / verbatim citation in your draft.
- Locate: for each, run
grep -in "<term>" <source-path>to obtain the actual line number. - Annotate: replace inline anchor with
(P0X §Y.Z:LineN)format — line number is the key Sonnet-Pilot upgrade vs haiku-pilot's section-only verification. - Verify: re-read the source at that line; if the context contradicts your phrasing → rewrite the citation, do not append a hedge.
Hard gate: every numeric citation in a hard-task answer must include :LineN. Citations without line numbers fail D1 audit and require a retry.
Empirical basis: 2026-05-08 6-agent 10Q benchmark — A3 (Opus+Pilot) earned D1 = 10/10 because every citation included grep-verified line numbers; A2 (Sonnet+Pilot) earned D1 = 9.5/10 because Source-Verify was section-level only. Closing this 0.5-point gap is the highest-ROI Sonnet-Pilot lever per § Verification analysis.
Per-Task Router
This router does NOT duplicate the full sub-agent dispatch table (see
subagent-strategy.md). It only flags the "Sonnet quality mode" deltas.
Task-Type Fast-Path
Identify task type first — skip inapplicable pre-flight steps to reduce overhead:
| Task type | Required pre-flight | Skip |
|---|---|---|
| Easy recall (≤ 100w answer; ≤ 5 facts; single source; pure definition / number lookup) | None mandatory; replace self-check with single-line fast-path: <reason> | All — direct answer |
| Wiki/ref extraction (summarize, extract, list from a document) | #1 Reference Pattern, G-02 Upgrade Eval, G-03 Anti-pattern Check, Self-check template | #3 Reasoning Chain, #5 Intermediate Checkpoint |
| Code implementation (bug fix, feature, refactor) | #2 Decision-Log, #3 Reasoning Chain, #4 Self-Review Loop | G-02, G-03, Self-check template |
| Multi-step agentic ( |