Override for Codex users who want Gemini, not a second Codex agent, to act as the reviewer. Install this package after
skills/skills-codex/*.
Research Idea Creator
Generate publishable research ideas for: $ARGUMENTS
Overview
Given a broad research direction from the user, systematically generate, validate, and rank concrete research ideas. This skill composes with /research-lit, /novelty-check, and /research-review to form a complete idea discovery pipeline.
Constants
-
PILOT_MAX_HOURS = 2 — Skip any pilot estimated to take > 2 hours per GPU. Flag as "needs manual pilot".
-
PILOT_TIMEOUT_HOURS = 3 — Hard timeout: kill pilots exceeding 3 hours. Collect partial results if available.
-
MAX_PILOT_IDEAS = 3 — Pilot at most 3 ideas in parallel. Additional ideas are validated on paper only.
-
MAX_TOTAL_GPU_HOURS = 8 — Total GPU budget for all pilots combined.
-
REVIEWER_MODEL =
gemini-review— Gemini reviewer invoked through the localgemini-reviewMCP bridge for brainstorming and critique. SetGEMINI_REVIEW_MODELif you need a specific Gemini model override. -
OUTPUT_DIR =
idea-stage/— Directory for idea output files.
💡 Override via argument, e.g.,
/idea-creator "topic" — pilot budget: 4h per idea, 20h total.
Workflow
Phase 1: Landscape Survey (5-10 min)
Map the research area to understand what exists and where the gaps are.
-
Scan local paper library first: Check
papers/andliterature/in the project directory for existing PDFs. Read first 3 pages of relevant papers to build a baseline understanding before searching online. This avoids re-discovering what the user already knows. -
Search recent literature using WebSearch:
- Top venues in the last 2 years (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
- Recent arXiv preprints (last 6 months)
- Use 5+ different query formulations
- Read abstracts and introductions of the top 10-15 papers
-
Build a landscape map:
- Group papers by sub-direction / approach
- Identify what has been tried and what hasn't
- Note recurring limitations mentioned in "Future Work" sections
- Flag any open problems explicitly stated by multiple papers
-
Identify structural gaps:
- Methods that work in domain A but haven't been tried in domain B
- Contradictory findings between papers (opportunity for resolution)
- Assumptions that everyone makes but nobody has tested
- Scaling regimes that haven't been explored
- Diagnostic questions that nobody has asked
Phase 2: Idea Generation (brainstorm with external LLM)
Use the local gemini-review MCP bridge for divergent thinking:
mcp__gemini-review__review_start:
prompt: |
You are a senior ML researcher brainstorming research ideas.
Research direction: [user's direction]
Here is the current landscape:
[paste landscape map from Phase 1]
Key gaps identified:
[paste gaps from Phase 1]
Generate 8-12 concrete research ideas. For each idea:
1. One-sentence summary
2. Core hypothesis (what you expect to find and why)
3. Minimum viable experiment (what's the cheapest way to test this?)
4. Expected contribution type: empirical finding / new method / theoretical result / diagnostic
5. Risk level: LOW (likely works) / MEDIUM (50-50) / HIGH (speculative)
6. Estimated effort: days / weeks / months
Prioritize ideas that are:
- Testable with moderate compute (8x RTX 3090 or less)
- Likely to produce a clear positive OR negative result (both are publishable)
- Not "apply X to Y" unless the application reveals genuinely surprising insights
- Differentiated from the 10-15 papers above
Be creative but grounded. A great idea is one where the answer matters regardless of which way it goes.
After this start call, immediately save the returned jobId and poll mcp__gemini-review__review_status with a bounded waitSeconds until done=true. Treat the completed status payload's response as the brainstorm output, and save the completed threadId for follow-up critique in Phase 4.
Phase 3: Mechanical consolidation + objective feasibility gate
This phase does NOT judge idea quality, novelty, or impact — those are the job of the Phase-4 cross-model reviewer (a different model family). Dropping ideas here on a same-family novelty or impact call would pre-filter the reviewer's input with same-family judgment — the opposite of why ARIS uses a cross-model reviewer at all. Phase 3 only (a) clusters near-duplicate ideas and (b) drops ideas that are OBJECTIVELY out of budget; everything else passes through ANNOTATED, not eliminated.
-
Objective feasibility gate (safe to gate here): drop an idea ONLY on a mechanical, budget-based fact — estimated compute > 1 week of available GPU time, OR a dataset that is provably unavailable. Do NOT drop on "implementation looks complex" — annotate complexity instead.
-
Novelty signal — ANNOTATE, do not eliminate: do 2-3 targeted searches and attach a
prior_worknote (what looks related, with links). This is input for the Phase-4 reviewer, not a filter; full/novelty-checkruns in Phase 4. Do NOT drop an idea here because it "might already be done." -
Impact signal — ANNOTATE, do not eliminate: attach a one-line
so_whatnote (why the result would matter either way). Do NOT drop on a same-family "a reviewer wouldn't care" call — that is exactly what the Phase-4 cross-model reviewer is for.
Every feasible, non-duplicate idea — with its prior_work and so_what
annotations — proceeds to Phase 4, where the cross-model reviewer does the
quality/novelty narrowing.
Phase 4: Deep Validation (for top ideas)
For each surviving idea, run a deeper evaluation:
-
Novelty check: Use the
/novelty-checkworkflow (multi-source search + Gemini cross-verification) for each idea -
Critical review: Use
mcp__gemini-review__review_reply_startwith the saved completedthreadId:mcp__gemini-review__review_reply_start: threadId: [saved completed threadId from Phase 2] prompt: | Here are our top ideas after filtering: [paste surviving ideas with novelty check results] For each, play devil's advocate: - What's the strongest objection a reviewer would raise? - What's the most likely failure mode? - How would you rank these for a top venue submission? - Which 2-3 would you actually work on?After this start call, immediately save the returned
jobIdand pollmcp__gemini-review__review_statuswith a boundedwaitSecondsuntildone=true. Treat the completed status payload'sresponseas the follow-up critique. -
Combine rankings: Merge your assessment with Gemini's ranking. Select top 2-3 ideas for pilot experiments.
Phase 5: Parallel Pilot Experiments (for top 2-3 ideas)
Before committing to a full research effort, run cheap pilot experiments to get empirical signal. This is the key differentiator from paper-only validation.
-
Design pilots: For each top idea, define the minimal experiment that would give a positive or negative signal:
- Single seed, small scale (e.g., small dataset subset, fewer epochs)
- Target: 30 min - PILOT_MAX_HOURS per pilot on 1 GPU
- Estimate GPU-hours BEFORE launching. If estimated time > PILOT_MAX_HOURS, reduce scale (fewer epochs, smaller subset) or flag as "needs manual pilot"
- Clear success metric defined upfront (e.g., "if metric improves by > 1%, signal is positive")
-
Deploy in parallel: Use
/run-experimentto launch pilots on different GPUs simultaneously:GPU 0: Pilot for Idea 1 GPU 1: Pilot for Idea 2 GPU 2: Pilot for Idea 3Use
run_in_background: trueto launch all at once. -
Collect results: Use
/monitor-experimentto check progress. If any pilot exceeds PILOT_TIMEOUT_HOURS, ki