SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

autoresearch

Pesquisa e Web

Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports optional loop count via Claude Code's /loop command. Invoking /autoresearch <free-form goal> builds a real-data benchmark harness, captures a baseline, and iterates with a regression gate until the goal is hit.

12estrelas
Ver no GitHub ↗Autor: MuminurLicença: MIT

Claude Autoresearch — Autonomous Goal-directed Iteration

Inspired by Karpathy's autoresearch. Applies constraint-driven autonomous iteration to ANY work — not just ML research.

Core idea: You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat.

Subcommands

SubcommandPurpose
/autoresearch <goal>Default path — parse free-form goal, build harness, capture baseline, loop until goal met
/autoresearchRun the autonomous loop (default)
/autoresearch:planInteractive wizard to build Scope, Metric, Direction & Verify from a Goal
/autoresearch:securityAutonomous security audit: STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas)

Default Path: /autoresearch <free-form goal>

When the user invokes /autoresearch <goal> with any free-form string after the command, parse the goal into seven slots, print the parsed-slot dump back for user visibility, then run the harness protocol in references/benchmark-harness.md before entering the loop.

Goal-parsing rubric:

SlotExtraction ruleFallback
metricFirst measurable noun (latency, reliability, coverage, flakiness, bundle size, p95, accuracy, error-rate, LOC, build time)Ask user (1 sentence)
directionreduce/lower/below/under/minimise/to 0% + cost-word → minimise; increase/raise/above/over/maximise/to 100% + quality-word → maximiseminimise for cost/time/size/error, maximise for coverage/score/throughput
targetNumber + unit in goal (500ms, 95%, 0%, <200KB)"best achievable" — unbounded loop
scopeGrep repo for goal's domain terms (API, test, build); propose globsWhole repo minus node_modules, .venv, dist, target
corpus_sourceIf goal names inputs (signals, queries, PRs, logs) → find source; if absent → ASK, never fabricatecorpus_required=false only when metric is purely structural (LOC, build time, bundle size)
verify_cmdSingle shell command that prints metric: <float> on stdout — typically python benchmark.py or equivalent single-file rigConstructed during harness build
regression_cmdAuto-detect: first of pytest -q, npm test, cargo test, go test ./... whose config existsAsk user

Worked examples:

/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, scope=src/api/**,
  corpus_source=prod log tail or fixtures, verify_cmd=python benchmark.py,
  regression_cmd=pytest -q

/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, scope=tests/**,
  corpus_source=CI run history, verify_cmd=python benchmark.py (N reruns),
  regression_cmd=pytest -q

/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, scope=src/parser/**,
  corpus_source=autoresearch/data/signals.jsonl, verify_cmd=python benchmark.py,
  regression_cmd=pytest -q

Print the parsed slot dump to the user before any action — this is the single confirmation checkpoint before the harness protocol begins.

/autoresearch:security — Autonomous Security Audit (v1.0.3)

Runs a comprehensive security audit using the autoresearch loop pattern. Generates a full STRIDE threat model, maps attack surfaces, then iteratively tests each vulnerability vector — logging findings with severity, OWASP category, and code evidence.

Load: references/security-workflow.md for full protocol.

What it does:

  1. Codebase Reconnaissance — scans tech stack, dependencies, configs, API routes
  2. Asset Identification — catalogs data stores, auth systems, external services, user inputs
  3. Trust Boundary Mapping — browser↔server, public↔authenticated, user↔admin, CI/CD↔prod
  4. STRIDE Threat Model — Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation of Privilege
  5. Attack Surface Map — entry points, data flows, abuse paths
  6. Autonomous Loop — iteratively tests each vector, validates with code evidence, logs findings
  7. Final Report — severity-ranked findings with mitigations, coverage matrix, iteration log

Key behaviors:

  • Follows red-team adversarial mindset (Security Adversary, Supply Chain, Insider Threat, Infra Attacker)
  • Every finding requires code evidence (file:line + attack scenario) — no theoretical fluff
  • Tracks OWASP Top 10 + STRIDE coverage, prints coverage summary every 5 iterations
  • Composite metric: (owasp_tested/10)*50 + (stride_tested/6)*30 + min(findings, 20) — higher is better
  • Creates security/{YYMMDD}-{HHMM}-{audit-slug}/ folder with structured reports: overview.md, threat-model.md, attack-surface-map.md, findings.md, owasp-coverage.md, dependency-audit.md, recommendations.md, security-audit-results.tsv

Flags:

FlagPurpose
--diffDelta mode — only audit files changed since last audit
--fixAfter audit, auto-fix confirmed Critical/High findings using autoresearch loop
--fail-on {severity}Exit non-zero if findings meet threshold (for CI/CD gating)

Usage:

# Unlimited — keep finding vulnerabilities until interrupted
/autoresearch:security

# Bounded — exactly 10 security sweep iterations
/loop 10 /autoresearch:security

# With focused scope
/autoresearch:security
Scope: src/api/**/*.ts, src/middleware/**/*.ts
Focus: authentication and authorization flows

# Delta mode — only audit changed files since last audit
/autoresearch:security --diff

# Auto-fix confirmed Critical/High findings after audit
/loop 15 /autoresearch:security --fix

# CI/CD gate — fail pipeline if any Critical findings
/loop 10 /autoresearch:security --fail-on critical

# Combined — delta audit + fix + gate
/loop 15 /autoresearch:security --diff --fix --fail-on critical

Inspired by:

  • Strix — AI-powered security testing with proof-of-concept validation
  • /plan red-team — adversarial review with hostile reviewer personas
  • OWASP Top 10 (2021) — industry-standard vulnerability taxonomy
  • STRIDE — Microsoft's threat modeling framework

/autoresearch:plan — Goal → Configuration Wizard

Converts a plain-language goal into a validated, ready-to-execute autoresearch configuration.

Load: references/plan-workflow.md for full protocol.

Quick summary:

  1. Capture Goal — ask what the user wants to improve (or accept inline text)
  2. Analyze Context — scan codebase for tooling, test runners, build scripts
  3. Define Scope — suggest file globs, validate they resolve to real files
  4. Define Metric — suggest mechanical metrics, validate they output a number
  5. Define Direction — higher or lower is better
  6. Define Verify — construct the shell command, dry-run it, confirm it works
  7. Confirm & Launch — present the complete config, offer to launch immediately

Critical gates:

  • Metric MUST be mechanical (outputs a parseable number, not subjective)
  • Verify command MUST pass a dry run on the current codebase before accepting
  • Scope MUST resolve to ≥1 file

Usage:

/autoresearch:plan
Goal: Make the API respond faster

/autoresearch:plan Increase test coverage to 95%

/autoresearch:plan Reduce bundle size below 200KB

After the wizard completes, the user gets a ready-to-paste /autoresearch invocation — or can launch it directly.

When to Activate

  • User invokes /autoresearch <goal-string> (anything after the command) → parse with Default Path rubric, then build harness per references/benchmark-harness.md
  • User types /autoresearch with no argument → ask for a one-sentence goal OR suggest /autoresearch:plan
  • User invokes /autoresearch or /ug:autoresearch → run the loop
  • User invokes /autoresearch:plan → run the planning wizard
  • User invokes `/autoresearch:secur

Como adicionar

/plugin marketplace add Muminur/autoresearch-skill-Andrej-Karpathy

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.