Auto Itera
Overview
Autonomous experimentation engine for AI engineering decisions. You provide three things — a goal, the candidate approaches, and a success threshold. auto-itera does the rest: sources real production data, scores baseline + arms in parallel, diagnoses per-row, iterates in disciplined sprints, runs the held-out test pass, and writes a one-page conclusion doc with publication-quality charts.
The pivot: AI engineering decisions move from handcrafted trial-and-error to autonomous scientific search. Instead of opening a notebook and eyeballing diffs, you delegate the whole experimental loop and get back a verdict you can defend in a code review.
Internally the loop is a six-phase scientific method — Frame → Source → Metric → Run → Diagnose → Conclude — with sprint-and-generalize refinement (not unbounded iteration, not arbitrary hard caps). The autonomy lives in phases 1, 3, 4, and 5; phases 0 (Frame) and 2 (Metric pre-registration) are the user-provided inputs that anchor the search.
Honest expectation boundary: auto-itera autonomously runs the experiment given a goal and candidate arms. It does not autonomously brainstorm what to test — you (or the human asking Claude) still write the candidate prompts / pick the candidate models / define the metric. The autonomy is "from candidates to verdict", not "from blank slate to verdict".
flowchart LR
F["Phase 0<br/>Frame"] --> S["Phase 1<br/>Source + Split"]
S --> M["Phase 2<br/>Metric"]
M --> R["Phase 3<br/>Run on dev"]
R --> D["Phase 4<br/>Diagnose"]
D -->|"3-iter sprint<br/>→ generalization gate<br/>→ continue or lock"| R
D --> V["Phase 5<br/>Verdict on<br/>held-out test"]
V --> C(["Conclusion doc<br/>+ charts"])
Core principle: Conclusions must come from data the experiment didn't tune on. Anything else is overfitting wearing a science costume.
When to Use
- Competing implementation choices ("should we use 2 LLMs or 1?", "Sonnet or Haiku?", "this prompt or that?")
- Production behavior is the ground truth (not synthetic benchmarks)
- The win must generalize, not just look good on cherry-picked examples
- A "ship or kill" decision is needed within ~1 day
When NOT to Use
- Decision can be made from first-principles reasoning alone (e.g., "this code path is wrong because it ignores
resetAt") — just fix it - No real production data available (synthetic-only experiments are weaker than thoughtful design)
- The arms are mechanically the same (e.g., "v1.0 prompt vs v1.0 prompt with whitespace edits") — you have nothing to test
The Six Phases
Phase 0 — Frame the question (BEFORE touching data)
Write down, on paper:
| Field | Example |
|---|---|
| Question | "Is splitting runCleanup into decision-LLM + generation-LLM better than the current single-call approach?" |
| Hypothesis | "Two-step is better when missed rows mostly fit existing questions (cheap decision call short-circuits expensive generation)." |
| Baseline | Current production behavior, exactly as deployed. Not a strawman. Reference the file:line. |
| Arms | 1-3 concrete alternatives. Each must change ONE thing vs baseline. If you change two things you can't attribute the win. |
| Stop conditions | "Arm wins if ≥X effect size on test set AND not overfit (per Phase 4 audit). Otherwise kill." |
Lock these before collecting data. Decisions made AFTER seeing scores are how p-hacking happens.
Phase 1 — Source data + split
Source from production whenever possible. Real customer rows + real prod state. Synthetic data underestimates the long tail.
| Set | Size | Purpose | Examined? |
|---|---|---|---|
| Train / scratch | ~30% | Develop arms, debug prompts, look at examples | freely |
| Dev | ~50% | Score iterations during Phase 3-4 | scores only — never read raw rows to tune |
| Test (held-out) | ~20% | Final verdict in Phase 5 | NEVER opened until Phase 5; ONE pass only |
Hard rule: Once you've seen a test-set row, it's contaminated. Move it to dev and reseal a fresh test set. No exceptions.
For small N (<50 rows total): use stratified k-fold on dev instead of a fixed split. Keep the test set sealed regardless.
Sampling philosophy — reflect, don't sanitize. When deduping by vendor / topic / class, the goal is to reflect the production distribution, not to maximize uniqueness. If real customers see the same vendor 3× / month, the dataset should too. Per-vendor cap of 1 is almost always wrong — it tests "novel" decisions only and misses how the system handles repetition. Default cap: ~3 rows per vendor, ~1/3 of total dataset per single dominant vendor max.
Sizing heuristic. Aim for N where you can detect the smallest effect you'd act on:
| Pre-registered effect threshold | Min total N (rough) | Reasoning |
|---|---|---|
| ≥10pp absolute | ~30 | Tail risk small; even small N catches obvious wins/losses |
| ≥5pp absolute | ~80 | Within-judge variance often ~3-5pp; need 2-3× that to see signal |
| ≥2pp absolute | ~300+ | Below judge noise floor unless judge is extremely consistent |
Cheap tax: prod data is ample for most decisions. If your dataset feels small, ask why first — usually a too-aggressive dedupe rule or a stratification that drops the actual long tail.
Distribution audit BEFORE Phase 3. Once split is done, print per-class / per-tenant counts and compare against production reality:
- "SodaGift conf=2 should have ~30 rows in prod; my sample has 0" → dedupe ate them, repair the sampler
- "BTC has 2× the rows of Sodagift but Sodagift has 100× more prod traffic" → re-stratify
- The system you're testing operates on prod's distribution, not the sampler's. Mismatch here = experiment tests the wrong thing, no amount of careful Phase 4 fixes that.
Phase 2 — Define metric (before any run)
One primary metric. Maybe two secondaries. More metrics = more knobs to cherry-pick.
The metric should:
- Map to the real-world stake ("is the customer's question merged into the right group?", not "did the LLM emit valid JSON?")
- Be computable per-row, not just aggregate — per-row diffs are where signal lives
- Have a known baseline value before the experiment starts (compute baseline-on-dev first, write it down)
- Have a meaningful effect-size threshold ("≥5pp absolute" or "≥10% relative") — pre-registered, not chosen after
If the metric requires LLM-judge: define the rubric BEFORE seeing arm output. Otherwise the rubric drifts to favor whichever arm reads better.
Phase 3 — Run on dev (baseline + arms in parallel)
Pilot run first (N=1). Before launching all arms × all rows, run baseline + each arm on ONE row from train and read every output by eye:
- Did each arm produce a sensible decision shape?
- Are all metric fields populated (not null, not placeholder strings)?
- Are cost + latency + token counts captured for every arm, including baseline?
- Does the judge see the actual reasoning text, not a stub like
"(runCleanup single call)"?
If any field is null or placeholder for any arm → stop, fix instrumentation, repeat pilot. The cost of rerunning N=1 is seconds. The cost of discovering an instrumentation bug after a 55-row × 3-arm × 5-min run is the entire iteration. This is the single highest-leverage rule in this skill.
Run all arms (baseline + N variants) against the same dev set. Same inputs, same evaluator, same versions. Capture:
- Aggregate score per arm
- Per-row scores (so Phase 4 can diff)
- LLM cost + latency per row (often as important as accuracy)
- Variance across runs if non-deterministic (≥3 trials, report mean ± stdev)
Variance baseline (the noise floor). Before celebrating any score gap, establish the within-arm variance:
- Run baseline on the same dev set 3× with no code change between runs. Record mean ± stdev.
- If two arms share an identical sub-call (e.g. same step-1 prom