Simmer Judge Board

Dispatch a panel of judges, let them score independently, deliberate, and converge on consensus scores + a single ASI. The board's output is identical to a single judge's output — the orchestrator can't tell the difference.

Why a Board

A single judge has blind spots. It anchors on whatever it notices first, and its ASI reflects one perspective. Three judges with different lenses catch different things and challenge each other. The ASI that emerges from deliberation is stronger because blind spots get surfaced.

This matters most at plateaus — when a single judge keeps suggesting the same class of fix because it can't see the real bottleneck.

Context You Receive

The board receives the same context the single judge would receive (passed through from the orchestrator):

Current candidate: full artifact text or key workspace files
Criteria rubric: 2-3 criteria with descriptions of what 10/10 looks like
Iteration number: which round this is
Seed calibration (iteration 1+): original seed + iteration-0 scores
Evaluator output (if evaluator mode): stdout/stderr from evaluator command
Context discipline extras (code/pipeline only): previous ASI, iteration history, search space, exploration status

Plus board-specific context:

JUDGE_PANEL (optional): custom judge definitions from setup brief
Problem class: text/creative, code/testable, or pipeline/engineering
ARTIFACT_TYPE: single-file or workspace — this determines what the generator can change
SEARCH_SPACE (if defined): explicit bounds on what's in scope to explore (models, topologies, prompt files, config parameters)
BACKGROUND: constraints, available resources, execution environment (model size, infrastructure, budget). Judges need this to calibrate their ASI to what the executor can actually do.
Previous deliberation summary (iteration 2+): structured as WORKING / NOT WORKING / DIRECTION. Judges must respect the WORKING list — these are elements that have been stable across iterations and should not be removed or changed. The NOT WORKING list prevents retrying failed approaches. The DIRECTION is where the panel's strategic reasoning lives. Build on prior conclusions rather than reasoning from scratch each iteration.

Mutation Bounds

Judges must understand what the generator can actually change. The ASI is worthless if it suggests something outside the generator's scope.

Artifact Type	Generator Can Change	Generator Cannot Change
single-file	The text content of the artifact (prompt, document, config file)	Model selection, code, infrastructure, pipeline topology, add new files
workspace	Any files in the workspace directory — code, config, prompts, scripts, add new files	Things outside the workspace, external infrastructure not in the search space

If SEARCH_SPACE is defined, it further constrains what's in scope. The generator can only change things within the search space bounds.

Every ASI the panel produces must be actionable within these bounds. If the panel concludes "the model is the bottleneck" but the artifact is single-file (can't swap models), the ASI should say "prompt changes alone won't break through this ceiling — recommend switching to workspace mode or early termination" rather than suggesting a model swap the generator can't execute.

Judges must read the relevant artifacts before scoring. Read the candidate, the evaluator script, config files, and prior candidates. Understand how the system works and why the scores are what they are. Research approaches if you see failure patterns you don't know how to fix. The ASI should come from understanding, not from reading metrics and guessing.

The Three Phases

Phase 1: Independent Scoring (Parallel)

Dispatch 3 judges as parallel subagents. Each receives the full judge context plus a unique LENS assignment.

Each judge invokes simmer:simmer-judge — the existing judge skill is reused. The lens is preamble context that frames their perspective.

Panelist prompt template:

You are one of three judges on a simmer judge board. Your role is to
score from your specific lens — the other judges cover other angles.

YOUR LENS: [name]
[lens description — what to focus on, what perspective you bring]

ARTIFACT_TYPE: [single-file | workspace]
SEARCH_SPACE: [what's in scope to change — omit if unconstrained]
WHAT THE GENERATOR CAN CHANGE: [single-file: text only | workspace: files in scope]

BACKGROUND:
[constraints, model size, infrastructure, available resources]

PREVIOUS PANEL DELIBERATION:
[WORKING / NOT WORKING / DIRECTION from last round — omit on iteration 0]

FILES YOU SHOULD READ:
- Candidate: [e.g., ./docs/simmer/iteration-2-candidate.md]
- Evaluator script: [e.g., ./evaluate.sh — omit if judge-only]
- Ground truth / test data: [e.g., ./test-data/expected.json — omit if unknown]
- Prior candidates: [e.g., ./docs/simmer/iteration-0-candidate.md, iteration-1-candidate.md — code/pipeline only]
- Config: [e.g., ./config.json — omit if none]

─── STEP 1: INVESTIGATE (required, before scoring) ───

Read the files listed above. Understand the problem before judging it.

On iteration 0 (seed):
- Read the evaluator script — understand HOW it scores (exact match?
  fuzzy? case-sensitive? what format does it expect?)
- Read the ground truth if accessible — what's the theoretical maximum?
  Are there unreachable targets?
- Read the background constraints — what can the model actually do?

Every iteration:
- Read the candidate file — structure and formatting matter, not just
  the text summary in this prompt
- [Code/pipeline only] Read prior candidates — what structural changes
  were tried and what was their effect? (Text/creative judges do NOT
  read prior candidates — this prevents anchoring to previous versions)
- When you see a failure pattern you don't know how to fix, SEARCH
  for solutions before proposing your ASI

─── STEP 2: SCORE (with full understanding) ───

Score ALL criteria from your lens — not just one. Your lens frames
HOW you analyze, not WHAT you analyze. Every judge scores every
criterion from their unique perspective. This gives cross-criterion
insight — one criterion improving while another regresses is a
trade-off the board needs to surface.

Your scores should be grounded in what you found during investigation,
not just observation of the evaluator output.

─── STEP 3: ASI (informed by research) ───

Your ASI candidate must:
- Be actionable within the generator's bounds
- Cite what you found during investigation — reference the evaluator
  mechanics, prior iteration results, or research you did
- Reference prior iterations when relevant — what was tried, why it
  worked or didn't, and how your suggestion differs
- If you searched for solutions, cite what you found

If the bottleneck is outside the generator's bounds, say so — the
clerk will recommend early termination or mode change.

Invoke the skill: simmer:simmer-judge

[... standard judge context from orchestrator ...]


Each panelist produces the standard judge output format:

ITERATION [N] SCORES: [criterion]: [N]/10 — [reasoning] — [specific improvement] COMPOSITE: [N.N]/10

ASI (highest-leverage direction): [their ASI candidate]


### Phase 2: Deliberation (One Round)

After collecting all independent scores, run one deliberation round. Each judge sees the other judges' scores + reasoning (but NOT their ASI candidates yet — ASI deliberation happens in synthesis).

**Deliberation prompt per panelist:**

You are deliberating on a simmer judge board.

YOUR INDEPENDENT SCORES (from Phase 1): [this judge's full output]

JUDGE B's SCORES: [judge B's scores + reasoning, not their ASI]

JUDGE C's SCORES: [judge C's scores + reasoning, not their ASI]

Review the other judges' scores and reasoning. For each criterion:

Agree — if

simmer-judge-board

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday