Simmer Setup
Inspect the artifact, infer what "better" means and how to measure it, propose an assessment to the user, produce the setup brief that drives the entire refinement loop.
Core principle: Inspect first, infer second, propose third, confirm last. The agent does the thinking — the user validates, adjusts, or overrides. Never ask the user to describe something the agent can read.
Phase 1: Identify and Inspect
Identify the Artifact
Look for:
- A file path mentioned or open in context
- Text pasted by the user
- A directory path or workspace
- A description of something to generate from scratch (seedless mode)
If ambiguous, ask once:
What are we refining?
1. A file (give me the path)
2. Something you'll paste
3. A workspace/directory (give me the path)
4. Generate from a description (I'll create the starting point)
Set mode and artifact type:
| Mode | Artifact Type | When |
|---|---|---|
| from-file | single-file | User provides a file path |
| from-paste | single-file | User pastes content |
| from-workspace | workspace | User provides a directory path |
| seedless | single-file or workspace | User describes what to create |
Inspect the Artifact
For single-file (from-file or from-paste):
- Read the file/content
- Identify what kind of artifact it is (prose, code, prompt, config, etc.)
- Note any evaluator references (test commands, benchmark scripts mentioned in comments)
- Note any output format expectations visible in the content
For workspace (from-workspace):
- List the directory contents
- Read key files: config files, entry points, scripts, READMEs
- Look specifically for:
- Evaluator scripts: files named
evaluate.*,test.*,benchmark.*, or scripts referenced in configs/READMEs - Validation scripts: files named
validate.*,check.*, or quick-test variants - Config files:
config.json,config.yaml,.env, etc. — these reveal what parameters can be varied - Output examples: any sample output, expected output, or ground truth files
- Strategy/plugin dirs: directories like
strategies/,plugins/,models/that indicate extensibility points - Prompt files:
prompt.md,system.txt, template files — things the generator can modify
- Evaluator scripts: files named
For seedless:
- No inspection needed — work from the user's description
- Classify the artifact type from the description
Phase 2: Classify and Infer
Problem Class Detection
Infer the problem class from what you found during inspection. Never ask the user what class this is.
IF mode == "seedless" AND description is prose/creative:
→ text/creative
ELSE IF artifact_type == "workspace" AND (evaluator script found OR user mentioned evaluator):
→ pipeline/engineering
ELSE IF evaluator found OR artifact is code:
→ code/testable
ELSE:
→ text/creative
What to Infer Per Class
Text/Creative — infer criteria only:
- Suggest 2-3 criteria based on the artifact type (see seed criteria table below)
- No contracts, no evaluator, no search space
- This path should feel lightweight
Code/Testable — infer criteria + evaluation:
- Suggest criteria based on the code's purpose
- If an evaluator script was found, note it as the proposed evaluator
- If the code produces structured output, infer the output contract from its format
- Note any constraints visible in the code (model references, API endpoints, etc.)
Pipeline/Engineering — infer everything:
- Criteria: from evaluator output metrics (if evaluator script is readable, look at what it measures)
- Evaluator: the evaluator script found during inspection
- Output contract: from pipeline output format, evaluator expectations, or example output
- Validation command: any quick-test script found, or propose a subset run (e.g., "run on 1 input instead of all N")
- Search space: from config parameters (models, temperatures, strategies), extensibility points (strategy dirs), and prompt files
- Background: from config values (API endpoints, model names already configured), directory structure
Seed Criteria Table
Use when proposing criteria. The agent should prefer criteria inferred from the actual artifact over these generic seeds.
| Artifact type | Suggested criteria |
|---|---|
| Document / spec | clarity, completeness, actionability |
| Creative writing | narrative tension, specificity, voice consistency |
| Email / comms | value prop clarity, tone match, call to action strength |
| Prompt / instructions | instruction precision, output predictability, edge case coverage |
| API design | contract completeness, developer ergonomics, consistency |
| Code (non-cookoff) | simplicity, robustness, readability |
| Adventure hook / game content | narrative tension, player agency, specificity |
| Blog post / article | argument clarity, engagement, structure |
| Pipeline / workflow | coverage, efficiency, noise |
| Configuration / infra | correctness, resource efficiency, maintainability |
Phase 3: Propose or Proceed
Sufficiency check: Before proposing, check whether the user's initial prompt + inspection results already provide everything needed for the brief:
- Artifact identified? (path, content, or description)
- Criteria determinable? (user stated them, or inferable from evaluator)
- Primary criterion stated? (user said it, or not applicable for text/creative)
- Evaluation method known? (evaluator script found, or judge-only for text)
- Iteration count? (user stated, or default 3)
If all fields are covered (from user prompt + inspection), skip the proposal and go directly to Phase 5 (emit brief). This is the common case when running as a subagent — the calling prompt provides intent/constraints and inspection fills in contracts.
If some fields are missing or ambiguous, present everything you inferred as a single conversational assessment. The user confirms, adjusts, or overrides. This is ONE message, not a sequence of questions.
Text/Creative Assessment
This is a [artifact type] — I'll use judge-only evaluation (no scripts to run).
For criteria, I'd suggest:
- [criterion 1]: [inferred description of what good looks like]
- [criterion 2]: [inferred description]
- [criterion 3]: [inferred description]
3 iterations, starting from [seed description].
Sound right, or want to adjust anything?
Code/Testable Assessment
This is [what the code does]. I found [evaluator/test script] which I'll
use to evaluate each iteration.
For criteria:
- [criterion 1]: [inferred from evaluator metrics or code purpose]
- [criterion 2]: [inferred]
- [criterion 3]: [inferred]
[If output contract inferred]: Output should be [format description].
[If constraints found]: I see [model/API/resource constraints].
3 iterations. Which criterion matters most, or are they equal?
Pipeline/Engineering Assessment
This is a pipeline optimization problem. Here's what I found:
**Evaluator:** [script path] — measures [what it measures, from reading the script]
**Output contract:** [inferred from pipeline output format / evaluator expectations]
**Validation:** [script path or proposed subset command] — [what it checks, estimated time]
**Search space:** [inferred from config + directory structure]
- Models: [from config values]
- Prompts: [prompt files found]
- Topology: [strategy dirs, extensibility points]
For criteria:
- [criterion 1]: [from evaluator metrics] — [primary?]
- [criterion 2]: [from evaluator metrics]
- [criterion 3]: [from evaluator metrics]
**Constraints:** [API endpoints, available infrastructure from config]
[N] iterations. Does this look right? Anything to add or change?
What the User Can Do
The user can:
- Confirm as-is ("looks good", "yes", "go") → proceed to brief
- Adjust specifics ("change the primary to X", "add Y to search space") → incorporate and proceed
- Override ("no, the evaluator is act