Simmer Setup

Inspect the artifact, infer what "better" means and how to measure it, propose an assessment to the user, produce the setup brief that drives the entire refinement loop.

Core principle: Inspect first, infer second, propose third, confirm last. The agent does the thinking — the user validates, adjusts, or overrides. Never ask the user to describe something the agent can read.

Phase 1: Identify and Inspect

Identify the Artifact

Look for:

A file path mentioned or open in context
Text pasted by the user
A directory path or workspace
A description of something to generate from scratch (seedless mode)

If ambiguous, ask once:

What are we refining?
1. A file (give me the path)
2. Something you'll paste
3. A workspace/directory (give me the path)
4. Generate from a description (I'll create the starting point)

Set mode and artifact type:

Mode	Artifact Type	When
from-file	single-file	User provides a file path
from-paste	single-file	User pastes content
from-workspace	workspace	User provides a directory path
seedless	single-file or workspace	User describes what to create

Inspect the Artifact

For single-file (from-file or from-paste):

Read the file/content
Identify what kind of artifact it is (prose, code, prompt, config, etc.)
Note any evaluator references (test commands, benchmark scripts mentioned in comments)
Note any output format expectations visible in the content

For workspace (from-workspace):

List the directory contents
Read key files: config files, entry points, scripts, READMEs
Look specifically for:
- Evaluator scripts: files named evaluate.*, test.*, benchmark.*, or scripts referenced in configs/READMEs
- Validation scripts: files named validate.*, check.*, or quick-test variants
- Config files: config.json, config.yaml, .env, etc. — these reveal what parameters can be varied
- Output examples: any sample output, expected output, or ground truth files
- Strategy/plugin dirs: directories like strategies/, plugins/, models/ that indicate extensibility points
- Prompt files: prompt.md, system.txt, template files — things the generator can modify

For seedless:

No inspection needed — work from the user's description
Classify the artifact type from the description

Phase 2: Classify and Infer

Problem Class Detection

Infer the problem class from what you found during inspection. Never ask the user what class this is.

IF mode == "seedless" AND description is prose/creative:
    → text/creative

ELSE IF artifact_type == "workspace" AND (evaluator script found OR user mentioned evaluator):
    → pipeline/engineering

ELSE IF evaluator found OR artifact is code:
    → code/testable

ELSE:
    → text/creative

What to Infer Per Class

Text/Creative — infer criteria only:

Suggest 2-3 criteria based on the artifact type (see seed criteria table below)
No contracts, no evaluator, no search space
This path should feel lightweight

Code/Testable — infer criteria + evaluation:

Suggest criteria based on the code's purpose
If an evaluator script was found, note it as the proposed evaluator
If the code produces structured output, infer the output contract from its format
Note any constraints visible in the code (model references, API endpoints, etc.)

Pipeline/Engineering — infer everything:

Criteria: from evaluator output metrics (if evaluator script is readable, look at what it measures)
Evaluator: the evaluator script found during inspection
Output contract: from pipeline output format, evaluator expectations, or example output
Validation command: any quick-test script found, or propose a subset run (e.g., "run on 1 input instead of all N")
Search space: from config parameters (models, temperatures, strategies), extensibility points (strategy dirs), and prompt files
Background: from config values (API endpoints, model names already configured), directory structure

Seed Criteria Table

Use when proposing criteria. The agent should prefer criteria inferred from the actual artifact over these generic seeds.

Artifact type	Suggested criteria
Document / spec	clarity, completeness, actionability
Creative writing	narrative tension, specificity, voice consistency
Email / comms	value prop clarity, tone match, call to action strength
Prompt / instructions	instruction precision, output predictability, edge case coverage
API design	contract completeness, developer ergonomics, consistency
Code (non-cookoff)	simplicity, robustness, readability
Adventure hook / game content	narrative tension, player agency, specificity
Blog post / article	argument clarity, engagement, structure
Pipeline / workflow	coverage, efficiency, noise
Configuration / infra	correctness, resource efficiency, maintainability

Phase 3: Propose or Proceed

Sufficiency check: Before proposing, check whether the user's initial prompt + inspection results already provide everything needed for the brief:

Artifact identified? (path, content, or description)
Criteria determinable? (user stated them, or inferable from evaluator)
Primary criterion stated? (user said it, or not applicable for text/creative)
Evaluation method known? (evaluator script found, or judge-only for text)
Iteration count? (user stated, or default 3)

If all fields are covered (from user prompt + inspection), skip the proposal and go directly to Phase 5 (emit brief). This is the common case when running as a subagent — the calling prompt provides intent/constraints and inspection fills in contracts.

If some fields are missing or ambiguous, present everything you inferred as a single conversational assessment. The user confirms, adjusts, or overrides. This is ONE message, not a sequence of questions.

Text/Creative Assessment

This is a [artifact type] — I'll use judge-only evaluation (no scripts to run).

For criteria, I'd suggest:
- [criterion 1]: [inferred description of what good looks like]
- [criterion 2]: [inferred description]
- [criterion 3]: [inferred description]

3 iterations, starting from [seed description].
Sound right, or want to adjust anything?

Code/Testable Assessment

This is [what the code does]. I found [evaluator/test script] which I'll
use to evaluate each iteration.

For criteria:
- [criterion 1]: [inferred from evaluator metrics or code purpose]
- [criterion 2]: [inferred]
- [criterion 3]: [inferred]

[If output contract inferred]: Output should be [format description].
[If constraints found]: I see [model/API/resource constraints].

3 iterations. Which criterion matters most, or are they equal?

Pipeline/Engineering Assessment

This is a pipeline optimization problem. Here's what I found:

**Evaluator:** [script path] — measures [what it measures, from reading the script]
**Output contract:** [inferred from pipeline output format / evaluator expectations]
**Validation:** [script path or proposed subset command] — [what it checks, estimated time]
**Search space:** [inferred from config + directory structure]
  - Models: [from config values]
  - Prompts: [prompt files found]
  - Topology: [strategy dirs, extensibility points]

For criteria:
- [criterion 1]: [from evaluator metrics] — [primary?]
- [criterion 2]: [from evaluator metrics]
- [criterion 3]: [from evaluator metrics]

**Constraints:** [API endpoints, available infrastructure from config]

[N] iterations. Does this look right? Anything to add or change?

What the User Can Do

The user can:

Confirm as-is ("looks good", "yes", "go") → proceed to brief
Adjust specifics ("change the primary to X", "add Y to search space") → incorporate and proceed
Override ("no, the evaluator is act

simmer-setup

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday