Opus Pilot — Ceiling-Elevation Execution Playbook
Thesis
Vanilla Opus is a strong solo solver. Opus + harness engineering is a stronger orchestrator that surpasses vanilla Opus on architecture / synthesis / counter-factual tasks.
AgentOpt (P10 / arxiv 2604.06296): On HotpotQA, Opus solo = 31.71%; Ministral planner + Opus solver = 74.27% (+42.56pp). Individual model capability does NOT predict ensemble performance. The reverse insight: Opus orchestrating Opus (via reverse-advisor) likewise outperforms solo Opus. Meta-Harness (P03 / arxiv 2603.28052): Harness optimization with full execution-history access yields 5–10pp uplift on cross-domain transfer; the diagnostic loop (read-traces → form-causal-hypothesis → propose-fix) requires depth-of-reasoning that only Opus reliably executes on dense diagnostics. Confucius (P02 / arxiv 2512.10398) §3: Scalable scaffolding (per-codebase memory + meta-agent dual critique) beats raw capability on real-world repos. CAR (P07 / preprints 202603.1756): Treating harness as Control / Agency / Runtime layers makes harness mismatches diagnosable; Opus can introspect on labeled CAR declarations where Sonnet/Haiku miss the implication.
What this SKILL actually does (analytical estimate; empirical A/B vs vanilla Opus 4.7 to be verified per § Verification):
- On hard analytical tasks (architecture, counter-factual, synthesis) → +8–15% quality uplift over vanilla Opus via Reverse-Advisor + Parallel Hypotheses + CAR explicitness.
- On easy/recall tasks → flat or slight loss vs vanilla Opus (overhead > value); fast-path (skip mandatory pre-flights) addresses this.
- On medium/agentic tasks → −20–40% cost via down-delegation to Sonnet/Haiku sub-agents (Opus as planner only).
→ Opus + this SKILL ≠ "always more thinking". It's structured thinking + explicit delegation. Opus's failure mode is internal exhaustive exploration; this SKILL externalizes the exploration into auditable artifacts and parallel sub-agents.
Cost ratio: Opus ($15 / $75) is 5× Sonnet, 15× Haiku. Opus mode should fire only when the task requires Opus reasoning depth. Use sonnet-pilot or haiku-pilot for the other 80% of tasks.
The Five Mechanisms (Beyond-Vanilla Levers)
Mechanism #1 — Reverse-Advisor Loop
Trigger: any global-architecture / cross-module / synthesis / counter-factual / security-design decision.
Procedure:
- Opus drafts the decision (Choice + Rationale + Rejected options).
- Call
advisor()(which is itself Opus with fresh context). - The advisor sees the full transcript; it acts as peer reviewer, not as fallback.
- If advisor disagrees → reconcile via one more advisor call ("I found X, you suggest Y; which constraint breaks the tie?").
- Only declare done after advisor concurs OR explicitly notes "remaining disagreement is judgment-bound".
Why Opus-specific: Sonnet/Haiku call advisor() to upgrade; Opus calls advisor() for peer review. The cost is 1× (Opus → Opus), justified on architecture-grade decisions. AgentOpt principle inverted: orchestrated > solo, even at the ceiling tier. (P10 AgentOpt 31.71% → 74.27%; P02 Confucius §4 dual-critique meta-agent)
Anti-pattern: skipping advisor() because "Opus already thought about it carefully". The advisor sees blind spots Opus's depth-first reasoning misses.
Mechanism #2 — Parallel Hypotheses + Synthesis (Multi-Sample)
Trigger: harness-design / system-architecture / "what's the right approach" tasks.
Procedure:
- Generate N=3 candidate approaches in parallel (single message, multiple Agent calls). Each sub-agent produces a complete proposal.
- Read all N. Rank on agreed criteria (e.g., simplicity, correctness, ablation-friendliness).
- Synthesize the best features into a final design — not "pick the best one", but "compose the strongest from N".
- Document which features came from which candidate (citation traceability).
Why Opus-specific: Sonnet/Haiku parallel sampling produces lower variance (≈ same answer 3×); Opus produces meaningfully different proposals that benefit from synthesis. AgentFlow §3.2 structured DSL constraint hints at this — multi-agent variants outperform single-agent runs. (P01 AgentFlow §3; P03 Meta-Harness ablation: 2–3 components drive 80% of gains, so synthesis must remain selective)
Anti-pattern: generating 7 candidates + writing a 1500-word tradeoff matrix. The cost exceeds the value past N=3.
Mechanism #3 — CAR Explicitness (Three-Layer Harness Verbalization)
Trigger: any harness-design / agent-architecture / runbook / tool-config task.
Procedure: structure the answer (and any artifacts produced) using Control / Agency / Runtime labels:
- Control: policies (when does X fire? what's the decision rule?)
- Agency: tool access (which tools, with what permissions, fan-out vs serial)
- Runtime: memory / compaction / state-store / observability
Why Opus-specific: Opus excels at introspecting on labeled structures; CAR labels enable self-diagnosis of harness mismatches mid-task ("The bug is in Runtime — observability is missing for state X"). Haiku/Sonnet read CAR labels but don't use them for self-diagnosis. (P07 CAR framework; P04 NLAH §3.2 explicit contracts)
Concrete artifacts to produce when this mechanism fires:
harness.md(or inline section): three sub-headings labeled## Control,## Agency,## Runtime.- Any cross-component dependency must cite the CAR label of the dependency target.
Mechanism #4 — Decision-Log Externalization (Live Journal)
Trigger: any task with ≥ 3 non-trivial decisions (most architecture / debugging / refactor tasks).
Procedure: maintain a live decision journal indexed by step number. Format (strict — no prose paragraphs):
Step <N>: <decision-name>
Choice: <what you decided>
Rejected: <option considered but excluded> — Reason: <one sentence>
The journal is part of the deliverable. Sub-agents (e.g., reviewer) audit the journal without re-reading full transcript.
Why Opus-specific: Sonnet's decision-log is enforcement of behavior Opus does naturally; for Opus, the lever is externalization — make the natural reasoning durable & reviewable. Token overhead < 5% on Opus due to its existing verbosity; on Haiku/Sonnet this would be 15–25% overhead. (P04 NLAH §6.1 artifact-backed closure; P08 OpenDev §8.3 transparency over abstraction)
Strict format violation triggers retry (Opus tendency: paragraph-long journal entries). One sentence per Reason line, full stop.
Mechanism #5 — Meta-Harness Filesystem Observability Loop
Trigger: failed task ≥ 1 attempt; harness-iteration / "this isn't working" diagnostic tasks.
Procedure:
- Persist raw execution traces (tool calls, errors, intermediate outputs) to filesystem (
/tmp/<task-id>/trace-<N>.log). - On failure, Opus reads the trace directly (not summarized). Forms causal hypothesis: "The failure is at trace step K because <invariant violated>".
- Propose single targeted fix (not 5-option redesign).
- Re-attempt. If fails again → escalate to advisor() (Mechanism #1).
Why Opus-specific: Opus reasoning on 10K+ tokens of raw diagnostic data is the unique capability; Sonnet/Haiku struggle past 5K diagnostic tokens. Meta-Harness P03 §4 showed full-history access gives 5–10pp uplift, conditional on the proposer being able to reason over that history. (P03 Meta-Harness §4; P02 Confucius §3.1 per-codebase persistent memory)
Anti-pattern: re-running the failed task with verbal "I'll be more careful this time". No causal hypothesis = no progress.
Per-Session Pre-flight (run once per task — task-type-conditioned)
Always-On Pre-flights
A. Ambiguity Surfacing (counters Opus over-confidence failure mode)
Before substantive work, if the task admits ≥ 2 reasonable i