Claude Autoresearch — Autonomous Goal-directed Iteration
Inspired by Karpathy's autoresearch. Applies constraint-driven autonomous iteration to ANY work — not just ML research.
Core idea: You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat.
Subcommands
| Subcommand | Purpose |
|---|---|
/autoresearch <goal> | Default path — parse free-form goal, build harness, capture baseline, loop until goal met |
/autoresearch | Run the autonomous loop (default) |
/autoresearch:plan | Interactive wizard to build Scope, Metric, Direction & Verify from a Goal |
/autoresearch:security | Autonomous security audit: STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas) |
Default Path: /autoresearch <free-form goal>
When the user invokes /autoresearch <goal> with any free-form string after the command, parse the goal into seven slots, print the parsed-slot dump back for user visibility, then run the harness protocol in references/benchmark-harness.md before entering the loop.
Goal-parsing rubric:
| Slot | Extraction rule | Fallback |
|---|---|---|
| metric | First measurable noun (latency, reliability, coverage, flakiness, bundle size, p95, accuracy, error-rate, LOC, build time) | Ask user (1 sentence) |
| direction | reduce/lower/below/under/minimise/to 0% + cost-word → minimise; increase/raise/above/over/maximise/to 100% + quality-word → maximise | minimise for cost/time/size/error, maximise for coverage/score/throughput |
| target | Number + unit in goal (500ms, 95%, 0%, <200KB) | "best achievable" — unbounded loop |
| scope | Grep repo for goal's domain terms (API, test, build); propose globs | Whole repo minus node_modules, .venv, dist, target |
| corpus_source | If goal names inputs (signals, queries, PRs, logs) → find source; if absent → ASK, never fabricate | corpus_required=false only when metric is purely structural (LOC, build time, bundle size) |
| verify_cmd | Single shell command that prints metric: <float> on stdout — typically python benchmark.py or equivalent single-file rig | Constructed during harness build |
| regression_cmd | Auto-detect: first of pytest -q, npm test, cargo test, go test ./... whose config exists | Ask user |
Worked examples:
/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, scope=src/api/**,
corpus_source=prod log tail or fixtures, verify_cmd=python benchmark.py,
regression_cmd=pytest -q
/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, scope=tests/**,
corpus_source=CI run history, verify_cmd=python benchmark.py (N reruns),
regression_cmd=pytest -q
/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, scope=src/parser/**,
corpus_source=autoresearch/data/signals.jsonl, verify_cmd=python benchmark.py,
regression_cmd=pytest -q
Print the parsed slot dump to the user before any action — this is the single confirmation checkpoint before the harness protocol begins.
/autoresearch:security — Autonomous Security Audit (v1.0.3)
Runs a comprehensive security audit using the autoresearch loop pattern. Generates a full STRIDE threat model, maps attack surfaces, then iteratively tests each vulnerability vector — logging findings with severity, OWASP category, and code evidence.
Load: references/security-workflow.md for full protocol.
What it does:
- Codebase Reconnaissance — scans tech stack, dependencies, configs, API routes
- Asset Identification — catalogs data stores, auth systems, external services, user inputs
- Trust Boundary Mapping — browser↔server, public↔authenticated, user↔admin, CI/CD↔prod
- STRIDE Threat Model — Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation of Privilege
- Attack Surface Map — entry points, data flows, abuse paths
- Autonomous Loop — iteratively tests each vector, validates with code evidence, logs findings
- Final Report — severity-ranked findings with mitigations, coverage matrix, iteration log
Key behaviors:
- Follows red-team adversarial mindset (Security Adversary, Supply Chain, Insider Threat, Infra Attacker)
- Every finding requires code evidence (file:line + attack scenario) — no theoretical fluff
- Tracks OWASP Top 10 + STRIDE coverage, prints coverage summary every 5 iterations
- Composite metric:
(owasp_tested/10)*50 + (stride_tested/6)*30 + min(findings, 20)— higher is better - Creates
security/{YYMMDD}-{HHMM}-{audit-slug}/folder with structured reports:overview.md,threat-model.md,attack-surface-map.md,findings.md,owasp-coverage.md,dependency-audit.md,recommendations.md,security-audit-results.tsv
Flags:
| Flag | Purpose |
|---|---|
--diff | Delta mode — only audit files changed since last audit |
--fix | After audit, auto-fix confirmed Critical/High findings using autoresearch loop |
--fail-on {severity} | Exit non-zero if findings meet threshold (for CI/CD gating) |
Usage:
# Unlimited — keep finding vulnerabilities until interrupted
/autoresearch:security
# Bounded — exactly 10 security sweep iterations
/loop 10 /autoresearch:security
# With focused scope
/autoresearch:security
Scope: src/api/**/*.ts, src/middleware/**/*.ts
Focus: authentication and authorization flows
# Delta mode — only audit changed files since last audit
/autoresearch:security --diff
# Auto-fix confirmed Critical/High findings after audit
/loop 15 /autoresearch:security --fix
# CI/CD gate — fail pipeline if any Critical findings
/loop 10 /autoresearch:security --fail-on critical
# Combined — delta audit + fix + gate
/loop 15 /autoresearch:security --diff --fix --fail-on critical
Inspired by:
- Strix — AI-powered security testing with proof-of-concept validation
/plan red-team— adversarial review with hostile reviewer personas- OWASP Top 10 (2021) — industry-standard vulnerability taxonomy
- STRIDE — Microsoft's threat modeling framework
/autoresearch:plan — Goal → Configuration Wizard
Converts a plain-language goal into a validated, ready-to-execute autoresearch configuration.
Load: references/plan-workflow.md for full protocol.
Quick summary:
- Capture Goal — ask what the user wants to improve (or accept inline text)
- Analyze Context — scan codebase for tooling, test runners, build scripts
- Define Scope — suggest file globs, validate they resolve to real files
- Define Metric — suggest mechanical metrics, validate they output a number
- Define Direction — higher or lower is better
- Define Verify — construct the shell command, dry-run it, confirm it works
- Confirm & Launch — present the complete config, offer to launch immediately
Critical gates:
- Metric MUST be mechanical (outputs a parseable number, not subjective)
- Verify command MUST pass a dry run on the current codebase before accepting
- Scope MUST resolve to ≥1 file
Usage:
/autoresearch:plan
Goal: Make the API respond faster
/autoresearch:plan Increase test coverage to 95%
/autoresearch:plan Reduce bundle size below 200KB
After the wizard completes, the user gets a ready-to-paste /autoresearch invocation — or can launch it directly.
When to Activate
- User invokes
/autoresearch <goal-string>(anything after the command) → parse with Default Path rubric, then build harness perreferences/benchmark-harness.md - User types
/autoresearchwith no argument → ask for a one-sentence goal OR suggest/autoresearch:plan - User invokes
/autoresearchor/ug:autoresearch→ run the loop - User invokes
/autoresearch:plan→ run the planning wizard - User invokes `/autoresearch:secur