Research Refine: Problem-Anchored, Elegant, Frontier-Aware Plan Refinement
Refine and concretize: $ARGUMENTS
Overview
Use this skill when the research problem is already visible but the technical route is still fuzzy. The goal is not to produce a bloated proposal or a benchmark shopping list. The goal is to turn a vague direction into a problem -> focused method -> minimal validation document that is concrete enough to implement, elegant enough to feel paper-worthy, and current enough to resonate in the foundation-model era.
Four principles dominate this skill:
- Do not lose the original problem. Freeze an immutable Problem Anchor and reuse it in every round.
- The smallest adequate mechanism wins. Prefer the minimal intervention that directly fixes the bottleneck.
- One paper, one dominant contribution. Prefer one sharp thesis plus at most one supporting contribution.
- Modern leverage is a prior, not a decoration. When LLM / VLM / Diffusion / RL / distillation / inference-time scaling naturally fit the bottleneck, use them concretely. Do not bolt them on as buzzwords.
User input (PROBLEM + vague APPROACH)
-> Phase 0 (Local step): Freeze Problem Anchor
-> Phase 1 (Local step): Scan grounding papers -> identify technical gap -> choose the sharpest route -> write focused proposal
-> Phase 2 (Codex/GPT-5.4): Review for fidelity, specificity, contribution quality, and frontier leverage
-> Phase 3 (Local step): Anchor check + simplicity check -> revise method -> rewrite full proposal
-> Phase 4 (Codex, same agent): Re-evaluate revised proposal
-> Repeat Phase 3-4 until OVERALL SCORE >= 9 or MAX_ROUNDS reached
-> Phase 5: Save full history to refine-logs/
-> Optional handoff: /experiment-plan for a detailed execution-ready experiment roadmap
Constants
- REVIEWER_MODEL =
gpt-5.5— Reviewer model used via a secondary Codex agent. - MAX_ROUNDS = 5 — Maximum review-revise rounds.
- SCORE_THRESHOLD = 9 — Minimum overall score to stop.
- OUTPUT_DIR =
refine-logs/— Directory for round files and final report. - MAX_LOCAL_PAPERS = 15 — Maximum local papers/notes to scan for grounding.
- MAX_CORE_EXPERIMENTS = 3 — Default cap for core validation blocks inside this skill.
- MAX_PRIMARY_CLAIMS = 2 — Soft cap for paper-level claims. Prefer one dominant claim plus one supporting claim.
- MAX_NEW_TRAINABLE_COMPONENTS = 2 — Soft cap for genuinely new trainable pieces. Exceed only if the paper breaks otherwise.
Override via argument if needed, e.g.
/research-refine "problem | approach" -- max rounds: 3, threshold: 9.
State Persistence (Checkpoint Recovery)
Long-running refinement sessions may fail mid-way (API timeout, context compaction, or session interruption). To avoid losing completed work, persist state to refine-logs/REFINE_STATE.json after each phase boundary:
{
"phase": "review",
"round": 1,
"agent_id": "019cd392-...",
"last_score": 6.5,
"last_verdict": "REVISE",
"status": "in_progress",
"timestamp": "2026-03-22T20:00:00"
}
Write after each completed phase. On completion, set "status": "completed".
Output Structure
refine-logs/
├── REFINE_STATE.json
├── round-0-initial-proposal.md
├── round-1-review.md
├── round-1-refinement.md
├── round-2-review.md
├── round-2-refinement.md
├── ...
├── REVIEW_SUMMARY.md
├── FINAL_PROPOSAL.md
├── REFINEMENT_REPORT.md
└── score-history.md
Every round-N-refinement.md must contain a full anchored proposal, not just incremental fixes.
Workflow
Initialization (Checkpoint Recovery)
Before starting any phase, check whether a previous run left a checkpoint:
- Check for
refine-logs/REFINE_STATE.json:- If it does not exist → fresh start
- If it exists and
statusis"completed"→ fresh start - If it exists and
statusis"in_progress"buttimestampis older than 24 hours → fresh start - If it exists and
statusis"in_progress"within 24 hours → resume
- On resume:
- Read all existing
refine-logs/round-*.mdfiles andscore-history.md - Recover
agent_idfor reviewer continuity - Resume from the next phase based on the saved
phase
- Read all existing
- On fresh start, ensure
refine-logs/exists and proceed to Phase 0.
Phase 0: Freeze the Problem Anchor
Before proposing anything, extract the user's immutable bottom-line problem. This anchor must be copied verbatim into every proposal and every refinement round.
Write:
- Bottom-line problem: What technical problem must be solved?
- Must-solve bottleneck: What specific weakness in current methods is unacceptable?
- Non-goals: What is explicitly not the goal of this project?
- Constraints: Compute, data, time, tooling, venue, deployment limits.
- Success condition: What evidence would make the user say "yes, this method addresses the actual problem"?
If later reviewer feedback would change the problem being solved, mark that as drift and push back or adapt carefully.
Checkpoint: Write refine-logs/REFINE_STATE.json with {"phase": "anchor", "round": 0, "agent_id": null, "last_score": null, "last_verdict": null, "status": "in_progress", "timestamp": "<now>"}.
Phase 1: Build the Initial Proposal
Step 1.1: Scan Grounding Material
Check papers/ and literature/ first. Read only the relevant parts needed to answer:
- What mechanism do current methods use?
- Where exactly do they fail for this problem?
- Which recent LLM / VLM / Diffusion / RL era techniques are actually relevant here?
- What training objectives, representations, or interfaces are reusable?
- What details distinguish a real method from a renamed high-level idea?
If local material is insufficient, search recent top-venue/arXiv work online. Focus on method sections, training setup, and failure modes, not just abstracts.
Step 1.2: Identify the Technical Gap
Do not stop at generic research questions. Make the gap operational:
- Current pipeline failure point: where does the baseline break?
- Why naive fixes are insufficient: larger context, more data, prompting, memory bank, or stacking more modules.
- Smallest adequate intervention: what is the least additional mechanism that could plausibly fix the bottleneck?
- Frontier-native alternative: is there a more current route using foundation-model-era primitives that better matches the bottleneck?
- Core technical claim: what exact mechanism claim could survive top-venue scrutiny?
- Required evidence: what minimum proof is needed to defend that claim?
Step 1.3: Choose the Sharpest Route
Before locking the method, compare two candidate routes if both are plausible:
- Route A: Elegant minimal route — the smallest mechanism that directly targets the bottleneck.
- Route B: Frontier-native route — a more modern route that uses LLM / VLM / Diffusion / RL / distillation / inference-time scaling only if it gives a cleaner or stronger story.
Then decide:
- Which route is more likely to become a strong paper under the stated constraints?
- Which route has the cleaner novelty story relative to the closest work?
- Which route avoids contribution sprawl?
If both routes are weak, rethink the framing instead of combining them into a larger system by default.
Step 1.4: Concretize the Method First
The proposal must answer "how would we actually build this?" Prefer method detail over broad experimentation and prefer reuse over invention.
Cover:
- One-sentence method thesis: the single strongest mechanism claim.
- Contribution focus: one dominant contribution and at most one supporting contribution.
- Complexity budget: what is frozen or reused, what is new, and what tempting additions are intentionally excluded.
- System graph: modules, data flow, inputs, outputs.
- Representation design: what latent, embedding, plan token, reward signa