Autoresearch

Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.

You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.

This runs fully autonomously. Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.

Getting Started

Users arrive in different states. Determine which and proceed:

User State	What to Do
Vague idea ("I want to explore X")	Brief discussion to clarify, then bootstrap
Clear research question	Bootstrap directly
Existing plan or proposal	Review plan, set up workspace, enter loops
Resuming (research-state.yaml exists)	Read state, continue from where you left off

If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.

Step 0 — before anything else: Set up the agent continuity loop. See Agent Continuity. This is MANDATORY. Without it, the research stops after one cycle.

Initialize Workspace

Create this structure at the project root:

{project}/
├── research-state.yaml       # Central state tracking
├── research-log.md           # Decision timeline
├── findings.md               # Evolving narrative synthesis
├── literature/               # Papers, survey notes
├── src/                      # Reusable code (utils, plotting, shared modules)
├── data/                     # Raw result data (CSVs, JSONs, checkpoints)
├── experiments/              # Per-hypothesis work
│   └── {hypothesis-slug}/
│       ├── protocol.md       # What, why, and prediction
│       ├── code/             # Experiment-specific code
│       ├── results/          # Raw outputs, metrics, logs
│       └── analysis.md       # What we learned
├── to_human/                 # Progress presentations and reports for human review
└── paper/                    # Final paper (via ml-paper-writing)

src/: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.
data/: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., trajectory_H1_runs001-010.csv). Large files like model checkpoints should go to a separate storage path (e.g., /data/, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.

Initialize research-state.yaml, research-log.md, and findings.md from templates/. Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.

The Two-Loop Architecture

This is the core engine. Everything else supports it.

BOOTSTRAP (once, lightweight)
  Scope question → search literature → form initial hypotheses

INNER LOOP (fast, autonomous, repeating)
  Pick hypothesis → experiment → measure → record → learn → next
  Goal: run constrained experiments with clear measurable outcomes

OUTER LOOP (periodic, reflective)
  Review results → find patterns → update findings.md →
  new hypotheses → decide direction
  Goal: synthesize understanding, find the story — this is where novelty comes from

FINALIZE (when concluding)
  Write paper via ml-paper-writing → final presentation → archive

The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results mean? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.

There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.

Research is Non-Linear

The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:

Return to literature when results surprise you, assumptions break, or you need context for a new direction — always save what you find to literature/
Brainstorm new ideas using 21-research-ideation/ skills when you're stuck or when results open unexpected questions
Pivot the question entirely if experiments reveal the original question was wrong or less interesting than what you found

This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.

Bootstrap: Literature and Hypotheses

Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.

Search literature for the research question. Use multiple sources — never stop at one:
- Exa MCP (web_search_exa) if available — best for broad discovery and finding relevant papers quickly
- Semantic Scholar (pip install semanticscholar) — best for ML/AI papers, citation graphs, and specific paper lookup. See 20-ml-paper-writing skill's references/citation-workflow.md for complete API code examples
- arXiv (pip install arxiv) — best for recent preprints and open-access papers
- CrossRef — best for DOI lookup and BibTeX retrieval
- Keep searching until you have good coverage. If one source comes up empty, try another with different keywords
Save everything to literature/: For every paper you find, save a summary to literature/ — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running literature/survey.md with all summaries. This is your reference library — you and future sessions will need it throughout the project.
Identify gaps from the literature
- What's been tried? What hasn't? Where do existing methods break?
- What do Discussion sections flag as future work?
Form initial hypotheses — invoke 21-research-ideation/ skills
- brainstorming-research-ideas for structured diverge-converge workflow
- creative-thinking-for-research for deeper cognitive frameworks
- Each hypothesis must be testable with a clear prediction
Define the evaluation
- Set the proxy metric and baseline before running experiments
- The metric should be computable quickly (minutes, not hours)
- Lock evaluation criteria upfront to prevent unconscious metric gaming
Record in research-state.yaml, log the bootstrap in research-log.md

The Inner Loop

Rapid iteration with clear measurable outcomes. Two flavors:

Optimization: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch.
Discovery: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize.

1.  Pick the highest-priority untested hypothesis
2.  Write a protocol: what change, what prediction, why
    Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
    This creates tempo

autoresearch

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday