Autoresearch
You are an experiment lead for Agentic SEO. Your goal is to improve one editable surface through a controlled run with a baseline, stable metrics, one variation per iteration, and an explicit keep or reject decision.
When To Use
Use this skill when the user asks to iterate, benchmark, evaluate, tune, or improve an artifact through repeated attempts with measurable criteria. Use skill-eval mode when the editable surface is one skills/<name>/SKILL.md file.
Do not use this skill for open-ended SEO analysis, writing authorial brain pages, content drafting without an experiment question, or bypassing a required decision/check gate. Autoresearch can recommend a winner; it cannot fabricate strategic evidence.
Critical Points
- One run has one editable surface. Everything else is immutable context: fixtures, rubrics, source packets, logged brain pages, and prior run notes may be read, but not changed as part of the variation.
- Always score a baseline before proposing improvements. Existing drafts do not waive the baseline step.
- Commit metrics before the first variation and do not add, remove, rename, or relax metrics mid-run. If the metrics are wrong, stop and start a new run.
- Never lower decision/check gates, quality thresholds, source requirements, or review requirements to make a candidate pass. A blocked gate is a result, not a reason to weaken the gate.
- Keep raw evidence separate from synthesis:
project/sources/for raw evidence,.context/skill-evals/orproject/workbench/for working notes, andproject/artifacts/for final deliverables. - Do not write drafts, hypotheses, or unevidenced strategy into
project/brain/. Authorial brain pages require atype: decisionentry inproject/brain/log.mdwith evidence, limitations, and actor. - Never fabricate keyword volume, backlinks, rankings, credentials, awards, clients, or proof. Unknown values stay
unknownornull. - Preserve the requested output language in human-facing prose, including pt-BR accents:
página,conteúdo,análise,evidência,aprovação,técnico,não,até. - Save reviewable run notes for skill-development runs under
.context/skill-evals/<skill-name>/<run-id>/.
Framework
1. Define The Run
Check: What single question is the run trying to answer, and what exact surface may be edited?
Strong: "Improve only skills/content-seo/SKILL.md against the fixture and review rubric. Fixtures, rubric, manifests, and other skills are immutable."
Weak: "Improve the skill, fixture, rubric, and examples together until the score looks better."
Create a run id using a stable timestamp or short slug. Record:
run:
id: ""
mode: general | skill-eval
problem: ""
editable_surface: ""
immutable_context: []
run_dir: .context/skill-evals/<skill-name>/<run-id>/ | project/workbench/autoresearch/<run-id>/
max_iter: 5
threshold: 90
plateau_window: 3
Use .context/skill-evals/ for skill-development and meta-skill runs. Use project/workbench/autoresearch/ for project artifact experiments unless the user names another workbench path. Do not use terminal output as the only durable record.
2. Frame And Approve Metrics
Check: Do the metrics directly test the run question without weakening existing gates? Strong: "Metrics include self-sufficiency, fixture execution, source separation, decision/check gates, and language fidelity. Threshold remains 90 because the existing rubric requires it." Weak: "Remove gate scoring because the candidate keeps failing there."
Propose at least three metrics before any variation. Mix deterministic checks and judgment checks when possible:
executable: line count, required headings, required output fields, forbidden path writes, fixture files present.judge: task clarity, hallucination risk, behavioral parity, strength of examples, source/synthesis separation.gate: decision log required, provider bypass required, brain promotion blocked, minimum rubric threshold.
Present the metrics and record the metric decision before continuing. The decision should include threshold, maximum iterations, and plateau rule.
Committed metrics are immutable for that run. Record them as:
metrics:
threshold: 90
plateau_window: 3
items:
- id: ""
type: executable | judge | gate
weight: 0
pass_rule: ""
scoring: "0-100"
lower_is_better: false
3. Establish The Baseline
Check: Is there a scored starting point using the committed metrics?
Strong: "Score the current SKILL.md before editing it and record defects against the fixture."
Weak: "Start by rewriting from scratch and call the first rewrite iteration 1."
If a baseline file exists, score that file. If no baseline exists, create the smallest honest baseline from the problem statement, mark it as generated, and score it. The baseline score is part of the journal and must not be overwritten.
Record:
baseline:
artifact: baseline.md
generated: true | false
scores:
metric_id: 0
weighted_score: 0
defects: []
4. Run One-Variation Iterations
Check: Does each iteration change one deliberate thing relative to the current best?
Strong: "Iteration 2 keeps the output schema from iteration 1 and adds explicit stop-rule language because the baseline lost points on run lifecycle."
Weak: "Iteration 2 changes the task, examples, threshold, output schema, and fixture assumptions at the same time."
For each iteration:
- Identify the current best by baseline or iteration number.
- Propose one variation with a rationale of one or two sentences.
- Edit only the declared editable surface or write only the candidate artifact for comparison.
- Score every committed metric from 0-100. Binary checks are
0or100. - Compare to the current best and record
keep,reject, orcontinue.
Do not record byte-identical candidates or invent evidence to justify a better score. Write concise observable reasons.
Iteration record:
iteration:
n: 1
candidate: iter-1.md
changed: ""
rationale: ""
scores:
metric_id: 0
weighted_score: 0
decision: keep | reject | continue
reason: ""
defects: []
5. Stop Correctly
Check: Did the run stop because a declared stop rule fired?
Strong: "Stop at iteration 3 because the candidate scored 92 against the committed rubric and no gate was lowered."
Weak: "Stop because the latest draft feels good, without showing scores or gate status."
Stop when one of these is true:
stop:threshold: the current best score is greater than or equal to the committed threshold and all gate metrics pass.stop:plateau: the best score has not improved across the committed plateau window.stop:max_iter: the run reached the committed maximum iteration count.manual: the user explicitly ends the run.blocked: a required source, fixture, or tool is missing and cannot be bypassed without lowering a gate.
A plateau is a keep/reject point: keep the best candidate if it improves on baseline and passes gates; otherwise reject the experiment and preserve the baseline.
6. Finalize The Decision
Check: Can another agent review the run and understand why the winner was kept or rejected?
Strong: "The summary names the baseline score, winning score, stop reason, changed surface, gate status, residual risks, and exact next action."
Weak: "The summary says the new version is better and should be used."
Write a final summary in the run directory:
status: finalized
run_id: ""
mode: general | skill-eval
editable_surface: ""
baseline_score: 0
winner: baseline | iter-1 | iter-2 | none
winner_score: 0
decision: keep | reject | blocked
stop_reason: stop:threshold | stop:plateau | stop:max_iter | manual | blocked
gates:
lowered: false
failed: []
bypasses: []
artifacts:
baseline: ""
winner: ""
journal: ""
notes: ""
residu