Five Invariants (never violate)
- Single mutable surface — one hypothesis per iteration, one change per experiment
- Fixed eval budget — eval runs in bounded time, no network calls in gates
- One scalar metric — composite score drives keep/discard, not vibes
- Binary keep/discard — improved = keep, else revert
git reset --hard HEAD~1 - Git-as-memory — every experiment is a commit, discards are reverts, history is the log
Safety rules
- Never modify
.lab/contents during hypothesis implementation - Never skip eval — every commit must be evaluated before keep/discard
- Always revert on crash —
atexithandler restores git state - Runner uses subscription auth (
claude -pwith ANTHROPIC_API_KEY stripped)
Autoresearch
Scaffold and run autonomous code improvement loops in any git repo. The pattern: generate a hypothesis via claude -p, implement it, run programmatic eval gates, keep if the composite score improves, discard if it doesn't. Proven across 50+ iterations on two codebases (shadow-engine: 0.69 to 1.0, perplexity-clone: search quality optimization).
Category
Runbooks — mechanical process with clear steps, not cognitive reasoning.
Quick Start
/autoresearch init # scaffold .lab/ in your repo
/autoresearch run # start the loop (default: 50 iterations)
/autoresearch status # check progress
/autoresearch resume # recover interrupted run
Command Dispatch
Parse $ARGUMENTS and route:
| Argument | Action |
|---|---|
init | Run scaffold workflow (see Init below) |
eval-gen | Regenerate eval gates from repo analysis |
run [--max-iterations N] [--dry-run] | Launch the autoresearch loop |
status | Show composite, timeline, convergence signals |
resume | Detect .lab/, present state, ask resume or fresh |
| (empty) | Show help text with available commands |
Init Workflow (/autoresearch init)
- Verify
.git/exists in current directory - Run stack detection:
python3 ~/.claude/skills/autoresearch/scripts/detect_stack.py - Review the detected stack info (language, build_cmd, test_cmd, lint_cmd)
- Run the scaffold script:
python3 ~/.claude/skills/autoresearch/scripts/scaffold.py --repo-root . --yes - Review
.lab/config.json— adjustkeep_threshold,max_iterations,gate_weightsif needed - Edit
.lab/program.md— this is the most important file. Add:- Specific areas to improve (not vague goals)
- Concrete hypothesis list (ranked)
- Constraints the agent must respect
- Run baseline eval to verify gates work:
python3 .lab/eval.py - Report the initial composite to the user
If .lab/ already exists, ask the user: resume existing lab, or archive to .lab.bak.<timestamp>/ and start fresh?
Eval-Gen Workflow (/autoresearch eval-gen)
Regenerate eval gates without re-scaffolding everything:
python3 ~/.claude/skills/autoresearch/scripts/eval_gen.py --repo-root . --output .lab/eval.py
Review the generated gates. The user may want to:
- Add custom gates for domain-specific behavior
- Adjust tier weights in
.lab/config.json - Add behavioral gates that test specific CLI invocations or API endpoints
Gates follow a 4-tier architecture:
| Tier | Weight | What it measures | Anti-cheat |
|---|---|---|---|
| T1: Build+Test | 0.20 | Compiles, tests pass, lint clean | Runs real commands, sums pass counts |
| T2: Behavioral | 0.40 | Integration tests, CLI output, API responses | Validates content, not file existence |
| T3: Pipeline | 0.25 | Build artifacts, installs, real I/O | File size >1KB, header validation |
| T4: Documentation | 0.15 | Test count floor, doc coverage | Counts code, never trusts comments |
Run Workflow (/autoresearch run)
python3 .lab/runner.py --max-iterations 50
Or for a dry run (prints hypothesis, creates no files):
python3 .lab/runner.py --dry-run --max-iterations 1
Monitor progress in a separate terminal:
tail -f .lab/results.tsv
The runner:
- Loads config from
.lab/config.json - Reads program.md for constraints and hypothesis direction
- Creates an
autoresearch/{date}branch - Loops: hypothesis via
claude -p-> implement viaclaude -p-> git commit -> eval -> keep/discard - Logs every experiment to
.lab/results.tsvwith extended statuses:
| Status | Meaning |
|---|---|
KEEP | Composite improved >= keep_threshold |
KEEP* | Primary improved but secondary metric regressed |
DISCARD | No improvement, reverted |
INTERESTING | Negative result that reveals structure, logged to dead-ends |
CRASH | Eval infrastructure failure, reverted |
TIMEOUT | Experiment exceeded timeout, logged as crash |
- Checks 9 convergence signals after each experiment (see
references/convergence-signals.md) - Re-validates baseline every 10 real experiments
- Auto-generates
.lab/eval-report.mdwith cumulative progress
Status Workflow (/autoresearch status)
python3 ~/.claude/skills/autoresearch/scripts/report.py --repo-root .
Shows: composite (live), experiment timeline, keeps/discards/crashes, active convergence signals, branch genealogy, dead-ends.
Resume Workflow (/autoresearch resume)
- Check if
.lab/exists - If yes: read
config.json,results.tsv, tail oflog.md - Present summary: objective, metrics, experiment count, current best vs baseline, last status
- Ask: resume (continue from last experiment) or fresh (archive
.lab.bak.<timestamp>/) - If resume: check for stale lock file, clean up if needed, then run
.lab/ Directory Layout
.lab/ # gitignored — experiment knowledge store
config.json # All parameters (repo_name, build_cmd, keep_threshold, etc.)
runner.py # Customized runner (from runner_template.py)
eval.py # Generated + user-extended eval gates
eval_base.py # Base framework (gate registration, composite scoring)
program.md # Human-maintained constraints + priorities
results.tsv # Experiment log (experiment_id, branch, parent, commit,
# composite, status, duration_s, description)
log.md # Narrative per-experiment entries
branches.md # Branch registry
dead-ends.md # Falsified approaches + why they failed
parking-lot.md # Deferred ideas for later
eval-report.md # Auto-generated cumulative report
runner-*.log # Runner stdout/stderr logs
.runner.lock # PID lock file (prevents concurrent runs)
Why .lab/ not autoresearch/: Code state (git) and experiment knowledge (.lab/) are fully decoupled. git reset --hard HEAD~1 (the core discard mechanic) never touches .lab/. Results survive branch operations.
Three-Tier Output Protocol
Eval gates emit structured diagnostics to stderr:
GATE build=PASS # Binary — blocks iteration on FAIL
METRIC test_count=475 # Continuous — tracked in results.tsv
TRACE gate_duration_ms=3200 # Execution data — for debugging only
Scripts Reference
| Script | Purpose | Run from |
|---|---|---|
scripts/detect_stack.py | Detect language, build system, test runner | Skill dir |
scripts/scaffold.py | Create .lab/ with all files | Skill dir |
scripts/eval_gen.py | Generate adversarial eval gates | Skill dir |
scripts/report.py | Render status report | Skill dir |
scripts/runner_template.py | Template copied to .lab/runner.py | Skill dir |
assets/eval_base.py | Base eval framework copied to .lab/ | Skill dir |
assets/config.json.tmpl | Config template with documented fields | Skill dir |
| `assets |