Sonnet Pilot — Quality-First Execution Playbook

Thesis

Deep context engineering + precise delegation = quality floor protection + predictable escalation.

Augment Code eval (2026-04): Best-quality AGENTS.md provides performance gain equivalent to one model tier (Haiku→Sonnet or Sonnet→Opus effective). Bad documentation is worse than no documentation, by ~30%. AgentOpt (arxiv 2604.06296): On HotpotQA, strongest planner alone (Opus full stack) = 31.71%; weak planner + strong solver (layered delegation) = 74.27%. Individual model capability does NOT predict ensemble performance.

What this SKILL actually does (empirically validated, A/B/C benchmark, n=5 wiki tasks, Opus 4.7 graded):

Sonnet + SKILL = 283/300; Sonnet Plain = 283/300; Opus = 286/300
SKILL net gain on routine extraction = 0: on structured wiki tasks, Sonnet already performs near-Opus without this playbook
SKILL value = floor protection on hard tasks: complex tasks with ambiguous requirements, multi-step agentic, cross-module design — where Sonnet without structure tends to miss assumptions, skip self-review, or produce over-engineered output

→ Think of this SKILL as insurance: no observable premium on easy days; prevents 10–30% quality loss on hard days. → If your tasks are mostly "extract facts from a structured document" → you may not see measurable gain; activate this for complex implementation and agentic tasks. → Sonnet as planner + advisor()/reviewer (Opus) on-demand > Opus full-stack for most tasks.

Cost ratio: Sonnet ($3/$15) vs Opus ($15/$75). Quality first does NOT mean always-Opus; Opus only fires when quantitative gates trigger.

Per-Session Pre-flight (run once per session)

1. Reference Pattern Technique

❌ Abstract description	✅ Point to a file
"Write a validation module"	"Follow the pattern in `src/auth/validator.ts`"
"Add a hook similar to useUser"	"Pattern: `hooks/useUser.ts:42-78`"
"Reference the deploy flow we used last time"	"Follow `<your-playbook>.md` Step 3-6"

Sonnet-specific: Sonnet given an abstract description tends to design from scratch; concrete reference = narrows design space to "variant of correct pattern", not "invent from zero".

2. Decision-Log Requirement (Opus has this built-in; Sonnet must enforce)

For every non-obvious technical decision, before completion output 2 lines:

Choice: <what you decided>
Rejected: <option considered but excluded> — Reason: <one sentence>

Why it matters: Opus naturally does this in reasoning; Sonnet tends to skip the intermediate decision record and emit results directly, making reasoning quality undiffable and review hard. Decision-log is a synthetic substitute for Opus's deeper thinking, externalized.

3. Reasoning Chain Before Code

Before implementation, output (don't skip):

1-2 sentence task understanding (your interpretation, not parroting)
Key assumptions (≥ 1, prefixed with "Assumption: ...")
Implementation plan (bullet list, ≤ 5 steps)
If multiple reasonable interpretations exist → list options for user confirmation, don't pick one yourself

Source: core.md § "Think-Before-Coding". Sonnet has higher confidence than Haiku, paradoxically MORE prone to skipping assumption verification; this pre-flight enforces it.

4. Self-Review Loop (quality gate)

Before declaring "complete":

git diff --stat   # confirm change scope
git status        # confirm no unintended additions

Explain each modified file in one sentence. Then:

Any change > 30 LoC → dispatch quick-code-reviewer sub-agent for post-implementation review
Touches auth/payment/user-data → dispatch security-reviewer
Architecture or cross-module design → call advisor() before declaring done

Sonnet-specific: Sonnet's high confidence is a double-edged sword — easy to skip self-review. This pledge is the externally bolted-on "doubt mechanism".

5. Intermediate Checkpoint (multi-step tasks)

For tasks with ≥ 3 independent steps, pause and verify after each:

Output: "Completed Step N / M: <one-sentence result>"
Confirm intermediate output matches expectation before proceeding
If anything unexpected → stop and ask, don't power through

Empirical basis: AgentOpt experiments show step roles in multi-step pipelines matter more than overall model capability; Sonnet's failure mode is "reasoning without verification checkpoints".

6. Mid-Write Outline Verification (architecture / counter-factual / synthesis tasks)

Trigger: tasks tagged Architecture / Counter-factual / Synthesis (per § Task-Type Fast-Path table).

Procedure (run once at ~200 words drafted, before continuing):

Check	Action on fail
Still inside the question's scope?	Drop tangential paragraphs
Projected total ≤ 800w? (current_wc × est_remaining_sections)	Cut outline; merge sections
All N mandatory citations / subquestions covered? (list them; checkmark)	Add missing items first; do not write more body until covered

Why this exists: Pre-flight #4 (Self-Review Loop, git diff --stat) fires after the answer is written. By that point, runaway answers are already 700–800 words. Mid-write checkpoint catches scope drift / missed mandatory citations while still cheap to fix.

Empirical basis: 2026-05-06 v0.2.1 benchmark Q18 — sonnet-pilot wrote 750 words on a CLAUDE.md refactor plan but omitted 2 of 3 required verbatim numbers (−28.64%, #33→#5). A 200-word checkpoint listing required citations would have surfaced the omission with 550 words still to write. Net loss: −6 vs Opus baseline.

Skip when: Easy recall, Wiki/ref extraction, or Code implementation tasks (where the gate adds overhead with no expected catch).

7. Source-Verify (Line-Level, required for paper / wiki / spec citation tasks)

After drafting any answer that cites numbers, model names, verbatim quotes, or paper conclusions, run this 4-step loop before declaring done:

Identify: list every numeric / proper-noun / verbatim citation in your draft.
Locate: for each, run grep -in "<term>" <source-path> to obtain the actual line number.
Annotate: replace inline anchor with (P0X §Y.Z:LineN) format — line number is the key Sonnet-Pilot upgrade vs haiku-pilot's section-only verification.
Verify: re-read the source at that line; if the context contradicts your phrasing → rewrite the citation, do not append a hedge.

Hard gate: every numeric citation in a hard-task answer must include :LineN. Citations without line numbers fail D1 audit and require a retry.

Empirical basis: 2026-05-08 6-agent 10Q benchmark — A3 (Opus+Pilot) earned D1 = 10/10 because every citation included grep-verified line numbers; A2 (Sonnet+Pilot) earned D1 = 9.5/10 because Source-Verify was section-level only. Closing this 0.5-point gap is the highest-ROI Sonnet-Pilot lever per § Verification analysis.

Per-Task Router

This router does NOT duplicate the full sub-agent dispatch table (see subagent-strategy.md). It only flags the "Sonnet quality mode" deltas.

Task-Type Fast-Path

Identify task type first — skip inapplicable pre-flight steps to reduce overhead:

Task type	Required pre-flight	Skip
Easy recall (≤ 100w answer; ≤ 5 facts; single source; pure definition / number lookup)	None mandatory; replace self-check with single-line `fast-path: <reason>`	All — direct answer
Wiki/ref extraction (summarize, extract, list from a document)	#1 Reference Pattern, G-02 Upgrade Eval, G-03 Anti-pattern Check, Self-check template	#3 Reasoning Chain, #5 Intermediate Checkpoint
Code implementation (bug fix, feature, refactor)	#2 Decision-Log, #3 Reasoning Chain, #4 Self-Review Loop	G-02, G-03, Self-check template
Multi-step agentic (

sonnet-pilot

How to add

Drop this on your repo README

Related skills

claude-api

skill-creator

oh-my-issues

claude-mem

Get new Desenvolvimento skills every Monday