SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

auto-itera

Dados e Análise

Use when the user wants to autonomously search for the best AI/engineering approach across competing candidates (prompts, models, retrieval strategies, architectures, algorithms) — give it a goal + candidate arms + success threshold, it runs the experiment to a defensible ship-or-kill verdict. Autonomously handles sourcing real production data, scoring arms in parallel, diagnosing per-row, sprint-

2estrelas
Ver no GitHub ↗Autor: clfhaha1234Licença: MIT

Auto Itera

Overview

Autonomous experimentation engine for AI engineering decisions. You provide three things — a goal, the candidate approaches, and a success threshold. auto-itera does the rest: sources real production data, scores baseline + arms in parallel, diagnoses per-row, iterates in disciplined sprints, runs the held-out test pass, and writes a one-page conclusion doc with publication-quality charts.

The pivot: AI engineering decisions move from handcrafted trial-and-error to autonomous scientific search. Instead of opening a notebook and eyeballing diffs, you delegate the whole experimental loop and get back a verdict you can defend in a code review.

Internally the loop is a six-phase scientific method — Frame → Source → Metric → Run → Diagnose → Conclude — with sprint-and-generalize refinement (not unbounded iteration, not arbitrary hard caps). The autonomy lives in phases 1, 3, 4, and 5; phases 0 (Frame) and 2 (Metric pre-registration) are the user-provided inputs that anchor the search.

Honest expectation boundary: auto-itera autonomously runs the experiment given a goal and candidate arms. It does not autonomously brainstorm what to test — you (or the human asking Claude) still write the candidate prompts / pick the candidate models / define the metric. The autonomy is "from candidates to verdict", not "from blank slate to verdict".

flowchart LR
    F["Phase 0<br/>Frame"] --> S["Phase 1<br/>Source + Split"]
    S --> M["Phase 2<br/>Metric"]
    M --> R["Phase 3<br/>Run on dev"]
    R --> D["Phase 4<br/>Diagnose"]
    D -->|"3-iter sprint<br/>→ generalization gate<br/>→ continue or lock"| R
    D --> V["Phase 5<br/>Verdict on<br/>held-out test"]
    V --> C(["Conclusion doc<br/>+ charts"])

Core principle: Conclusions must come from data the experiment didn't tune on. Anything else is overfitting wearing a science costume.

When to Use

  • Competing implementation choices ("should we use 2 LLMs or 1?", "Sonnet or Haiku?", "this prompt or that?")
  • Production behavior is the ground truth (not synthetic benchmarks)
  • The win must generalize, not just look good on cherry-picked examples
  • A "ship or kill" decision is needed within ~1 day

When NOT to Use

  • Decision can be made from first-principles reasoning alone (e.g., "this code path is wrong because it ignores resetAt") — just fix it
  • No real production data available (synthetic-only experiments are weaker than thoughtful design)
  • The arms are mechanically the same (e.g., "v1.0 prompt vs v1.0 prompt with whitespace edits") — you have nothing to test

The Six Phases

Phase 0 — Frame the question (BEFORE touching data)

Write down, on paper:

FieldExample
Question"Is splitting runCleanup into decision-LLM + generation-LLM better than the current single-call approach?"
Hypothesis"Two-step is better when missed rows mostly fit existing questions (cheap decision call short-circuits expensive generation)."
BaselineCurrent production behavior, exactly as deployed. Not a strawman. Reference the file:line.
Arms1-3 concrete alternatives. Each must change ONE thing vs baseline. If you change two things you can't attribute the win.
Stop conditions"Arm wins if ≥X effect size on test set AND not overfit (per Phase 4 audit). Otherwise kill."

Lock these before collecting data. Decisions made AFTER seeing scores are how p-hacking happens.

Phase 1 — Source data + split

Source from production whenever possible. Real customer rows + real prod state. Synthetic data underestimates the long tail.

SetSizePurposeExamined?
Train / scratch~30%Develop arms, debug prompts, look at examplesfreely
Dev~50%Score iterations during Phase 3-4scores only — never read raw rows to tune
Test (held-out)~20%Final verdict in Phase 5NEVER opened until Phase 5; ONE pass only

Hard rule: Once you've seen a test-set row, it's contaminated. Move it to dev and reseal a fresh test set. No exceptions.

For small N (<50 rows total): use stratified k-fold on dev instead of a fixed split. Keep the test set sealed regardless.

Sampling philosophy — reflect, don't sanitize. When deduping by vendor / topic / class, the goal is to reflect the production distribution, not to maximize uniqueness. If real customers see the same vendor 3× / month, the dataset should too. Per-vendor cap of 1 is almost always wrong — it tests "novel" decisions only and misses how the system handles repetition. Default cap: ~3 rows per vendor, ~1/3 of total dataset per single dominant vendor max.

Sizing heuristic. Aim for N where you can detect the smallest effect you'd act on:

Pre-registered effect thresholdMin total N (rough)Reasoning
≥10pp absolute~30Tail risk small; even small N catches obvious wins/losses
≥5pp absolute~80Within-judge variance often ~3-5pp; need 2-3× that to see signal
≥2pp absolute~300+Below judge noise floor unless judge is extremely consistent

Cheap tax: prod data is ample for most decisions. If your dataset feels small, ask why first — usually a too-aggressive dedupe rule or a stratification that drops the actual long tail.

Distribution audit BEFORE Phase 3. Once split is done, print per-class / per-tenant counts and compare against production reality:

  • "SodaGift conf=2 should have ~30 rows in prod; my sample has 0" → dedupe ate them, repair the sampler
  • "BTC has 2× the rows of Sodagift but Sodagift has 100× more prod traffic" → re-stratify
  • The system you're testing operates on prod's distribution, not the sampler's. Mismatch here = experiment tests the wrong thing, no amount of careful Phase 4 fixes that.

Phase 2 — Define metric (before any run)

One primary metric. Maybe two secondaries. More metrics = more knobs to cherry-pick.

The metric should:

  1. Map to the real-world stake ("is the customer's question merged into the right group?", not "did the LLM emit valid JSON?")
  2. Be computable per-row, not just aggregate — per-row diffs are where signal lives
  3. Have a known baseline value before the experiment starts (compute baseline-on-dev first, write it down)
  4. Have a meaningful effect-size threshold ("≥5pp absolute" or "≥10% relative") — pre-registered, not chosen after

If the metric requires LLM-judge: define the rubric BEFORE seeing arm output. Otherwise the rubric drifts to favor whichever arm reads better.

Phase 3 — Run on dev (baseline + arms in parallel)

Pilot run first (N=1). Before launching all arms × all rows, run baseline + each arm on ONE row from train and read every output by eye:

  • Did each arm produce a sensible decision shape?
  • Are all metric fields populated (not null, not placeholder strings)?
  • Are cost + latency + token counts captured for every arm, including baseline?
  • Does the judge see the actual reasoning text, not a stub like "(runCleanup single call)"?

If any field is null or placeholder for any arm → stop, fix instrumentation, repeat pilot. The cost of rerunning N=1 is seconds. The cost of discovering an instrumentation bug after a 55-row × 3-arm × 5-min run is the entire iteration. This is the single highest-leverage rule in this skill.

Run all arms (baseline + N variants) against the same dev set. Same inputs, same evaluator, same versions. Capture:

  • Aggregate score per arm
  • Per-row scores (so Phase 4 can diff)
  • LLM cost + latency per row (often as important as accuracy)
  • Variance across runs if non-deterministic (≥3 trials, report mean ± stdev)

Variance baseline (the noise floor). Before celebrating any score gap, establish the within-arm variance:

  • Run baseline on the same dev set 3× with no code change between runs. Record mean ± stdev.
  • If two arms share an identical sub-call (e.g. same step-1 prom

Como adicionar

/plugin marketplace add clfhaha1234/auto-itera

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.