Claude Autoresearch — Autonomous Goal-directed Iteration

Inspired by Karpathy's autoresearch. Applies constraint-driven autonomous iteration to ANY work — not just ML research.

Core idea: You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat.

Subcommands

Subcommand	Purpose
`/autoresearch <goal>`	Default path — parse free-form goal, build harness, capture baseline, loop until goal met
`/autoresearch`	Run the autonomous loop (default)
`/autoresearch:plan`	Interactive wizard to build Scope, Metric, Direction & Verify from a Goal
`/autoresearch:security`	Autonomous security audit: STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas)

Default Path: /autoresearch <free-form goal>

When the user invokes /autoresearch <goal> with any free-form string after the command, parse the goal into seven slots, print the parsed-slot dump back for user visibility, then run the harness protocol in references/benchmark-harness.md before entering the loop.

Goal-parsing rubric:

Slot	Extraction rule	Fallback
metric	First measurable noun (`latency`, `reliability`, `coverage`, `flakiness`, `bundle size`, `p95`, `accuracy`, `error-rate`, `LOC`, `build time`)	Ask user (1 sentence)
direction	`reduce/lower/below/under/minimise/to 0%` + cost-word → minimise; `increase/raise/above/over/maximise/to 100%` + quality-word → maximise	minimise for cost/time/size/error, maximise for coverage/score/throughput
target	Number + unit in goal (`500ms`, `95%`, `0%`, `<200KB`)	`"best achievable"` — unbounded loop
scope	Grep repo for goal's domain terms (`API`, `test`, `build`); propose globs	Whole repo minus `node_modules`, `.venv`, `dist`, `target`
corpus_source	If goal names inputs (signals, queries, PRs, logs) → find source; if absent → ASK, never fabricate	`corpus_required=false` only when metric is purely structural (LOC, build time, bundle size)
verify_cmd	Single shell command that prints `metric: <float>` on stdout — typically `python benchmark.py` or equivalent single-file rig	Constructed during harness build
regression_cmd	Auto-detect: first of `pytest -q`, `npm test`, `cargo test`, `go test ./...` whose config exists	Ask user

Worked examples:

/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, scope=src/api/**,
  corpus_source=prod log tail or fixtures, verify_cmd=python benchmark.py,
  regression_cmd=pytest -q

/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, scope=tests/**,
  corpus_source=CI run history, verify_cmd=python benchmark.py (N reruns),
  regression_cmd=pytest -q

/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, scope=src/parser/**,
  corpus_source=autoresearch/data/signals.jsonl, verify_cmd=python benchmark.py,
  regression_cmd=pytest -q

Print the parsed slot dump to the user before any action — this is the single confirmation checkpoint before the harness protocol begins.

/autoresearch:security — Autonomous Security Audit (v1.0.3)

Runs a comprehensive security audit using the autoresearch loop pattern. Generates a full STRIDE threat model, maps attack surfaces, then iteratively tests each vulnerability vector — logging findings with severity, OWASP category, and code evidence.

Load: references/security-workflow.md for full protocol.

What it does:

Codebase Reconnaissance — scans tech stack, dependencies, configs, API routes
Asset Identification — catalogs data stores, auth systems, external services, user inputs
Trust Boundary Mapping — browser↔server, public↔authenticated, user↔admin, CI/CD↔prod
STRIDE Threat Model — Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation of Privilege
Attack Surface Map — entry points, data flows, abuse paths
Autonomous Loop — iteratively tests each vector, validates with code evidence, logs findings
Final Report — severity-ranked findings with mitigations, coverage matrix, iteration log

Key behaviors:

Follows red-team adversarial mindset (Security Adversary, Supply Chain, Insider Threat, Infra Attacker)
Every finding requires code evidence (file:line + attack scenario) — no theoretical fluff
Tracks OWASP Top 10 + STRIDE coverage, prints coverage summary every 5 iterations
Composite metric: (owasp_tested/10)*50 + (stride_tested/6)*30 + min(findings, 20) — higher is better
Creates security/{YYMMDD}-{HHMM}-{audit-slug}/ folder with structured reports: overview.md, threat-model.md, attack-surface-map.md, findings.md, owasp-coverage.md, dependency-audit.md, recommendations.md, security-audit-results.tsv

Flags:

Flag	Purpose
`--diff`	Delta mode — only audit files changed since last audit
`--fix`	After audit, auto-fix confirmed Critical/High findings using autoresearch loop
`--fail-on {severity}`	Exit non-zero if findings meet threshold (for CI/CD gating)

Usage:

# Unlimited — keep finding vulnerabilities until interrupted
/autoresearch:security

# Bounded — exactly 10 security sweep iterations
/loop 10 /autoresearch:security

# With focused scope
/autoresearch:security
Scope: src/api/**/*.ts, src/middleware/**/*.ts
Focus: authentication and authorization flows

# Delta mode — only audit changed files since last audit
/autoresearch:security --diff

# Auto-fix confirmed Critical/High findings after audit
/loop 15 /autoresearch:security --fix

# CI/CD gate — fail pipeline if any Critical findings
/loop 10 /autoresearch:security --fail-on critical

# Combined — delta audit + fix + gate
/loop 15 /autoresearch:security --diff --fix --fail-on critical

Inspired by:

Strix — AI-powered security testing with proof-of-concept validation
/plan red-team — adversarial review with hostile reviewer personas
OWASP Top 10 (2021) — industry-standard vulnerability taxonomy
STRIDE — Microsoft's threat modeling framework

/autoresearch:plan — Goal → Configuration Wizard

Converts a plain-language goal into a validated, ready-to-execute autoresearch configuration.

Load: references/plan-workflow.md for full protocol.

Quick summary:

Capture Goal — ask what the user wants to improve (or accept inline text)
Analyze Context — scan codebase for tooling, test runners, build scripts
Define Scope — suggest file globs, validate they resolve to real files
Define Metric — suggest mechanical metrics, validate they output a number
Define Direction — higher or lower is better
Define Verify — construct the shell command, dry-run it, confirm it works
Confirm & Launch — present the complete config, offer to launch immediately

Critical gates:

Metric MUST be mechanical (outputs a parseable number, not subjective)
Verify command MUST pass a dry run on the current codebase before accepting
Scope MUST resolve to ≥1 file

Usage:

/autoresearch:plan
Goal: Make the API respond faster

/autoresearch:plan Increase test coverage to 95%

/autoresearch:plan Reduce bundle size below 200KB

After the wizard completes, the user gets a ready-to-paste /autoresearch invocation — or can launch it directly.

When to Activate

User invokes /autoresearch <goal-string> (anything after the command) → parse with Default Path rubric, then build harness per references/benchmark-harness.md
User types /autoresearch with no argument → ask for a one-sentence goal OR suggest /autoresearch:plan
User invokes /autoresearch or /ug:autoresearch → run the loop
User invokes /autoresearch:plan → run the planning wizard
User invokes `/autoresearch:secur

autoresearch

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday