Autoresearch
Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.
You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.
This runs fully autonomously. Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.
Getting Started
Users arrive in different states. Determine which and proceed:
| User State | What to Do |
|---|---|
| Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap |
| Clear research question | Bootstrap directly |
| Existing plan or proposal | Review plan, set up workspace, enter loops |
| Resuming (research-state.yaml exists) | Read state, continue from where you left off |
If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.
Step 0 — before anything else: Set up the agent continuity loop. See Agent Continuity. This is MANDATORY. Without it, the research stops after one cycle.
Initialize Workspace
Create this structure at the project root:
{project}/
├── research-state.yaml # Central state tracking
├── research-log.md # Decision timeline
├── findings.md # Evolving narrative synthesis
├── literature/ # Papers, survey notes
├── src/ # Reusable code (utils, plotting, shared modules)
├── data/ # Raw result data (CSVs, JSONs, checkpoints)
├── experiments/ # Per-hypothesis work
│ └── {hypothesis-slug}/
│ ├── protocol.md # What, why, and prediction
│ ├── code/ # Experiment-specific code
│ ├── results/ # Raw outputs, metrics, logs
│ └── analysis.md # What we learned
├── to_human/ # Progress presentations and reports for human review
└── paper/ # Final paper (via ml-paper-writing)
src/: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.data/: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g.,trajectory_H1_runs001-010.csv). Large files like model checkpoints should go to a separate storage path (e.g.,/data/, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.
Initialize research-state.yaml, research-log.md, and findings.md from templates/. Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.
The Two-Loop Architecture
This is the core engine. Everything else supports it.
BOOTSTRAP (once, lightweight)
Scope question → search literature → form initial hypotheses
INNER LOOP (fast, autonomous, repeating)
Pick hypothesis → experiment → measure → record → learn → next
Goal: run constrained experiments with clear measurable outcomes
OUTER LOOP (periodic, reflective)
Review results → find patterns → update findings.md →
new hypotheses → decide direction
Goal: synthesize understanding, find the story — this is where novelty comes from
FINALIZE (when concluding)
Write paper via ml-paper-writing → final presentation → archive
The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results mean? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.
There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.
Research is Non-Linear
The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:
- Return to literature when results surprise you, assumptions break, or you need context for a new direction — always save what you find to
literature/ - Brainstorm new ideas using
21-research-ideation/skills when you're stuck or when results open unexpected questions - Pivot the question entirely if experiments reveal the original question was wrong or less interesting than what you found
This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.
Bootstrap: Literature and Hypotheses
Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.
-
Search literature for the research question. Use multiple sources — never stop at one:
- Exa MCP (
web_search_exa) if available — best for broad discovery and finding relevant papers quickly - Semantic Scholar (
pip install semanticscholar) — best for ML/AI papers, citation graphs, and specific paper lookup. See20-ml-paper-writingskill'sreferences/citation-workflow.mdfor complete API code examples - arXiv (
pip install arxiv) — best for recent preprints and open-access papers - CrossRef — best for DOI lookup and BibTeX retrieval
- Keep searching until you have good coverage. If one source comes up empty, try another with different keywords
Save everything to
literature/: For every paper you find, save a summary toliterature/— title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a runningliterature/survey.mdwith all summaries. This is your reference library — you and future sessions will need it throughout the project. - Exa MCP (
-
Identify gaps from the literature
- What's been tried? What hasn't? Where do existing methods break?
- What do Discussion sections flag as future work?
-
Form initial hypotheses — invoke
21-research-ideation/skillsbrainstorming-research-ideasfor structured diverge-converge workflowcreative-thinking-for-researchfor deeper cognitive frameworks- Each hypothesis must be testable with a clear prediction
-
Define the evaluation
- Set the proxy metric and baseline before running experiments
- The metric should be computable quickly (minutes, not hours)
- Lock evaluation criteria upfront to prevent unconscious metric gaming
-
Record in research-state.yaml, log the bootstrap in research-log.md
The Inner Loop
Rapid iteration with clear measurable outcomes. Two flavors:
- Optimization: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch.
- Discovery: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize.
1. Pick the highest-priority untested hypothesis
2. Write a protocol: what change, what prediction, why
Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
This creates tempo