Opus Pilot — Ceiling-Elevation Execution Playbook

Thesis

Vanilla Opus is a strong solo solver. Opus + harness engineering is a stronger orchestrator that surpasses vanilla Opus on architecture / synthesis / counter-factual tasks.

AgentOpt (P10 / arxiv 2604.06296): On HotpotQA, Opus solo = 31.71%; Ministral planner + Opus solver = 74.27% (+42.56pp). Individual model capability does NOT predict ensemble performance. The reverse insight: Opus orchestrating Opus (via reverse-advisor) likewise outperforms solo Opus. Meta-Harness (P03 / arxiv 2603.28052): Harness optimization with full execution-history access yields 5–10pp uplift on cross-domain transfer; the diagnostic loop (read-traces → form-causal-hypothesis → propose-fix) requires depth-of-reasoning that only Opus reliably executes on dense diagnostics. Confucius (P02 / arxiv 2512.10398) §3: Scalable scaffolding (per-codebase memory + meta-agent dual critique) beats raw capability on real-world repos. CAR (P07 / preprints 202603.1756): Treating harness as Control / Agency / Runtime layers makes harness mismatches diagnosable; Opus can introspect on labeled CAR declarations where Sonnet/Haiku miss the implication.

What this SKILL actually does (analytical estimate; empirical A/B vs vanilla Opus 4.7 to be verified per § Verification):

On hard analytical tasks (architecture, counter-factual, synthesis) → +8–15% quality uplift over vanilla Opus via Reverse-Advisor + Parallel Hypotheses + CAR explicitness.
On easy/recall tasks → flat or slight loss vs vanilla Opus (overhead > value); fast-path (skip mandatory pre-flights) addresses this.
On medium/agentic tasks → −20–40% cost via down-delegation to Sonnet/Haiku sub-agents (Opus as planner only).

→ Opus + this SKILL ≠ "always more thinking". It's structured thinking + explicit delegation. Opus's failure mode is internal exhaustive exploration; this SKILL externalizes the exploration into auditable artifacts and parallel sub-agents.

Cost ratio: Opus ($15 / $75) is 5× Sonnet, 15× Haiku. Opus mode should fire only when the task requires Opus reasoning depth. Use sonnet-pilot or haiku-pilot for the other 80% of tasks.

The Five Mechanisms (Beyond-Vanilla Levers)

Mechanism #1 — Reverse-Advisor Loop

Trigger: any global-architecture / cross-module / synthesis / counter-factual / security-design decision.

Procedure:

Opus drafts the decision (Choice + Rationale + Rejected options).
Call advisor() (which is itself Opus with fresh context).
The advisor sees the full transcript; it acts as peer reviewer, not as fallback.
If advisor disagrees → reconcile via one more advisor call ("I found X, you suggest Y; which constraint breaks the tie?").
Only declare done after advisor concurs OR explicitly notes "remaining disagreement is judgment-bound".

Why Opus-specific: Sonnet/Haiku call advisor() to upgrade; Opus calls advisor() for peer review. The cost is 1× (Opus → Opus), justified on architecture-grade decisions. AgentOpt principle inverted: orchestrated > solo, even at the ceiling tier. (P10 AgentOpt 31.71% → 74.27%; P02 Confucius §4 dual-critique meta-agent)

Anti-pattern: skipping advisor() because "Opus already thought about it carefully". The advisor sees blind spots Opus's depth-first reasoning misses.

Mechanism #2 — Parallel Hypotheses + Synthesis (Multi-Sample)

Trigger: harness-design / system-architecture / "what's the right approach" tasks.

Procedure:

Generate N=3 candidate approaches in parallel (single message, multiple Agent calls). Each sub-agent produces a complete proposal.
Read all N. Rank on agreed criteria (e.g., simplicity, correctness, ablation-friendliness).
Synthesize the best features into a final design — not "pick the best one", but "compose the strongest from N".
Document which features came from which candidate (citation traceability).

Why Opus-specific: Sonnet/Haiku parallel sampling produces lower variance (≈ same answer 3×); Opus produces meaningfully different proposals that benefit from synthesis. AgentFlow §3.2 structured DSL constraint hints at this — multi-agent variants outperform single-agent runs. (P01 AgentFlow §3; P03 Meta-Harness ablation: 2–3 components drive 80% of gains, so synthesis must remain selective)

Anti-pattern: generating 7 candidates + writing a 1500-word tradeoff matrix. The cost exceeds the value past N=3.

Mechanism #3 — CAR Explicitness (Three-Layer Harness Verbalization)

Trigger: any harness-design / agent-architecture / runbook / tool-config task.

Procedure: structure the answer (and any artifacts produced) using Control / Agency / Runtime labels:

Control: policies (when does X fire? what's the decision rule?)
Agency: tool access (which tools, with what permissions, fan-out vs serial)
Runtime: memory / compaction / state-store / observability

Why Opus-specific: Opus excels at introspecting on labeled structures; CAR labels enable self-diagnosis of harness mismatches mid-task ("The bug is in Runtime — observability is missing for state X"). Haiku/Sonnet read CAR labels but don't use them for self-diagnosis. (P07 CAR framework; P04 NLAH §3.2 explicit contracts)

Concrete artifacts to produce when this mechanism fires:

harness.md (or inline section): three sub-headings labeled ## Control, ## Agency, ## Runtime.
Any cross-component dependency must cite the CAR label of the dependency target.

Mechanism #4 — Decision-Log Externalization (Live Journal)

Trigger: any task with ≥ 3 non-trivial decisions (most architecture / debugging / refactor tasks).

Procedure: maintain a live decision journal indexed by step number. Format (strict — no prose paragraphs):

Step <N>: <decision-name>
Choice: <what you decided>
Rejected: <option considered but excluded> — Reason: <one sentence>

The journal is part of the deliverable. Sub-agents (e.g., reviewer) audit the journal without re-reading full transcript.

Why Opus-specific: Sonnet's decision-log is enforcement of behavior Opus does naturally; for Opus, the lever is externalization — make the natural reasoning durable & reviewable. Token overhead < 5% on Opus due to its existing verbosity; on Haiku/Sonnet this would be 15–25% overhead. (P04 NLAH §6.1 artifact-backed closure; P08 OpenDev §8.3 transparency over abstraction)

Strict format violation triggers retry (Opus tendency: paragraph-long journal entries). One sentence per Reason line, full stop.

Mechanism #5 — Meta-Harness Filesystem Observability Loop

Trigger: failed task ≥ 1 attempt; harness-iteration / "this isn't working" diagnostic tasks.

Procedure:

Persist raw execution traces (tool calls, errors, intermediate outputs) to filesystem (/tmp/<task-id>/trace-<N>.log).
On failure, Opus reads the trace directly (not summarized). Forms causal hypothesis: "The failure is at trace step K because <invariant violated>".
Propose single targeted fix (not 5-option redesign).
Re-attempt. If fails again → escalate to advisor() (Mechanism #1).

Why Opus-specific: Opus reasoning on 10K+ tokens of raw diagnostic data is the unique capability; Sonnet/Haiku struggle past 5K diagnostic tokens. Meta-Harness P03 §4 showed full-history access gives 5–10pp uplift, conditional on the proposer being able to reason over that history. (P03 Meta-Harness §4; P02 Confucius §3.1 per-codebase persistent memory)

Anti-pattern: re-running the failed task with verbal "I'll be more careful this time". No causal hypothesis = no progress.

Per-Session Pre-flight (run once per task — task-type-conditioned)

Always-On Pre-flights

A. Ambiguity Surfacing (counters Opus over-confidence failure mode)

Before substantive work, if the task admits ≥ 2 reasonable i

opus-pilot

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday

Opus Pilot — Ceiling-Elevation Execution Playbook

Thesis

The Five Mechanisms (Beyond-Vanilla Levers)

Mechanism #1 — Reverse-Advisor Loop

Mechanism #2 — Parallel Hypotheses + Synthesis (Multi-Sample)

Mechanism #3 — CAR Explicitness (Three-Layer Harness Verbalization)

Mechanism #4 — Decision-Log Externalization (Live Journal)

Mechanism #5 — Meta-Harness Filesystem Observability Loop

Per-Session Pre-flight (run once per task — task-type-conditioned)

Always-On Pre-flights

A. Ambiguity Surfacing (counters Opus over-confidence failure mode)

Comments · No comments