Agent Arena
Overview
Agent Arena is a reusable protocol skill for AI coding agents and LLM agent harnesses. Use it when one agent is likely to be overconfident, trapped in a single framing, or missing evidence.
The core idea: independent heterogeneous agents first, debate later, evidence before consensus, dissent preserved.
Agent Arena is designed for Claude Code, OpenAI Codex, Hermes Agent, OpenClaw, OpenCode, Copilot CLI, and other autonomous coding agents or agentic workflows that support custom skills, custom instructions, or tool-driven delegation. It also explicitly supports Claude Code configured with alternative model backends — including GLM (Zhipu AI), DeepSeek, Qwen (Alibaba), Kimi (Moonshot), Doubao (ByteDance), and others accessible via an Anthropic-protocol-compatible proxy or endpoint. A Claude Code session running on a different model family is a genuinely heterogeneous participant.
Protocol note: Claude Code speaks the Anthropic API protocol; Codex speaks the OpenAI API protocol. Alternative models (DeepSeek, GLM, Qwen, etc.) typically expose OpenAI-compatible APIs, so they connect to Codex directly. They connect to Claude Code via a proxy or adapter (such as One API, LiteLLM, or a provider's Anthropic-compatible endpoint) that translates between the Anthropic API format and the target model's API.
Capability boundary: this skill is not an executable orchestrator. It does not install, authenticate, or automatically call external agents. Cross-agent execution requires a host agent or human operator with the relevant CLI/tool access, credentials, permissions, and network availability.
When to Use
Use this skill when the task involves:
- Multi-agent debate or panel review
- Codex vs Claude Code comparison
- Architecture decisions or implementation plan reviews
- Complex bug root-cause analysis
- PR/code review with high consequence
- Research synthesis that needs source checking
- LLM-as-a-judge, agent judge, agent game theory, or debate workflows
- Red teaming a design, prompt, implementation, benchmark, or experiment plan
- Avoiding single-model-family blind spots
- Cross-model backend comparison (e.g. GLM-backed Claude Code vs Codex, DeepSeek vs Claude, Qwen vs GPT)
Do not use full Agent Arena for:
- Simple factual lookups
- Translation, formatting, or summarization
- One obvious local tool call
- Low-risk tasks where the user asked for speed
- Cases where deterministic tests or source code inspection alone answer the question
Quick Decision Gate
Before starting, choose the lightest mode that can work:
solo_red_team: one agent performs structured self-critique when no heterogeneous counterpart is available.quick_panel: two or more agents give short independent opinions; no heavy evidence ledger.design_debate: independent proposals → critique → steelman → revision → judge → synthesis.collaborative_design: Codex and Claude Code co-design a solution through multiple rounds: independent sketches → exchange constraints and critiques → jointly refine interface/architecture → converge on an implementation plan with preserved dissent.evidence_arena: claims require web, docs, source, test, or benchmark evidence.red_team: adversarially challenge a design, plan, prompt, benchmark, or safety assumption.code_review_arena: review code, diffs, pull requests, or implementation details.bug_root_cause_arena: compare root-cause hypotheses and required checks.implementation_plan_review: review implementation plans before coding or delegation.decision_memo_arena: high-stakes recommendation with dissent and uncertainty.tree_search: explore a large option space with branching strategies.full_arena: independent generation, evidence, critique, revision, blind judging, synthesis.
Triage before you commit — both directions matter. "Lightest mode that can work" is the rule only after triage, not the triage rule itself. Under-triage (too light) is as much a failure as over-triage (too heavy).
Escalate beyond quick_panel/solo_red_team to collaborative_design, deliberative_analysis, or full_arena if ANY of these fire:
- Persistent or hard-to-reverse side effects — changing a schema, writing config, uploading data/runs, or setting a policy that affects all future steps.
- Redesign, not point review — you are (re)designing a durable structure, data contract, interface, or allow/deny list, not reviewing or tweaking one concrete spot.
- Genuinely interdependent decisions — several choices must be made together because changing one forces the others. (Ordinary implementation detail does not count: "this function affects later code" is not coupling; "the metric schema dictates the case-data contract dictates the logging policy" is.)
- Repeating a known past mistake — the task partly exists to avoid re-doing something that already went wrong (e.g. re-uploading noisy runs).
- Output becomes a durable contract consumed by other steps or people — a data contract, logging policy, or interface with real blast radius. (A local helper or a signature only this task uses is not a contract; the bar is durability plus external consumers.)
Stay light when none fire: a single reversible low-consequence question, the user asked for speed, or deterministic checks / source inspection already answer it.
Core Principles
- Independence before discussion — agents must produce initial answers before seeing each other.
- Evidence beats consensus — agreement between LLMs is not proof.
- Deterministic checks beat model judgment — tests, source code, docs, logs, benchmarks, and calculators outrank opinions.
- Heterogeneity must be real — different model families, harnesses, tools, prompts, or evidence paths are better than same-model roleplay. Claude Code configured with a different model backend (GLM, DeepSeek, Qwen, Kimi, etc.) counts as a genuinely heterogeneous participant — the model-family difference is real even if the harness is shared.
- No forced consensus — preserve strong minority views when uncertainty remains.
- Expose dissent — final answers must include the best counterargument.
- Degrade honestly — if an agent, tool, or search source is unavailable, state the degraded mode and confidence impact.
- Right-size the arena — pick the lightest mode that fully covers the task. Under-triaging a complex or irreversible task is as much a failure as over-orchestrating a simple one; when escalation triggers fire (see Quick Decision Gate), do not stay light.
- Human checkpoints for high-risk actions — do not push, deploy, delete, spend money, or expose secrets without appropriate confirmation.
- Context minimization without blindness — start with a compact task packet, but allow agents to read necessary source/docs when evidence requires it, subject to the permission boundary.
Safety and Privacy Rules
Before delegating to another agent, running web search, or sending context to any external service:
- Confirm the user allows that data to leave the current agent or machine when private/sensitive material is involved.
- Separate scope permission from content dumping: it is often acceptable to grant an external coding agent read access to the repository/worktree while still forbidding it to quote or exfiltrate unrelated files.
- Remove or deny access to secrets, credentials, access tokens, customer data, private logs, generated result files, datasets, and unrelated proprietary code unless explicitly required and approved.
- Do not cripple evidence gathering by forbidding all file reads. For code/design review, external agents should be allowed to read relevant source files, configs, tests, docs, and dependency manifests when needed.
- Prefer passing a compact task packet first, then let the external agent request/read additional files within the approved scope.
- Treat