Workflow 1.5: Experiment Bridge

Implement and deploy experiments from plan: $ARGUMENTS

Overview

This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.

Workflow 1 output:                    This skill:                                    Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md   →   implement → GPT-5.5 review → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md     code        (cross-model)    /run-experiment     for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md

Constants

CODE_REVIEW = true — GPT-5.5 xhigh reviews experiment code before deployment. Catches logic bugs before wasting GPU hours. Set false to skip.
AUTO_DEPLOY = true — Automatically deploy experiments after implementation + review. Set false to manually inspect code before deploying.
SANITY_FIRST = true — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early.
MAX_PARALLEL_RUNS = 4 — Maximum number of experiments to deploy in parallel (limited by available GPUs).
BASE_REPO = false — GitHub repo URL to use as base codebase. When set, clone the repo first and implement experiments on top of it. When false (default), write code from scratch or reuse existing project files.
COMPACT = false — When true, (1) read idea-stage/IDEA_CANDIDATES.md instead of full idea-stage/IDEA_REPORT.md if available, (2) append experiment results to EXPERIMENT_LOG.md after collection.

Override: /experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project

Inputs

This skill expects one or more of:

refine-logs/EXPERIMENT_PLAN.md (best) — claim-driven experiment roadmap from /experiment-plan
refine-logs/EXPERIMENT_TRACKER.md — run-by-run execution table
refine-logs/FINAL_PROPOSAL.md — method description for implementation context
idea-stage/IDEA_CANDIDATES.md — compact idea summary (preferred when COMPACT: true) (fall back to ./IDEA_CANDIDATES.md if not found)
idea-stage/IDEA_REPORT.md — full brainstorm output (fall back to ./IDEA_REPORT.md if not found)

If none exist, ask the user what experiments to implement.

Workflow

Phase 1: Parse the Experiment Plan

Read EXPERIMENT_PLAN.md and extract:

Run order and milestones — which experiments run first (sanity → baseline → main → ablation → polish)
For each experiment block:
- Dataset / split / task
- Compared systems and variants
- Metrics to compute
- Setup details (backbone, hyperparameters, seeds)
- Success criterion
- Priority (MUST-RUN vs NICE-TO-HAVE)
Compute budget — total estimated GPU-hours
Method details from FINAL_PROPOSAL.md — what exactly to implement

Present a brief summary:

📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]

Proceeding to implementation.

Phase 2: Implement Experiment Code

If BASE_REPO is set — clone the repo first:

git clone <BASE_REPO> base_repo/
# Read the repo's README, understand its structure, find entry points
# Implement experiments by modifying/extending this codebase

For each milestone (in order), write the experiment scripts:

Check existing code — scan the project (or cloned base_repo/) for existing experiment scripts, model code, data loaders. Reuse as much as possible.
Implement missing pieces:
- Training scripts with proper argparse (all hyperparameters configurable)
- Evaluation scripts computing the specified metrics
- Data loading / preprocessing if needed
- Baseline implementations if not already present
- Fixed random seeds for reproducibility
- Results saved to JSON/CSV for later analysis
- Proper logging (wandb if configured in CLAUDE.md)
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
- Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse?
- Is the random seed fixed and controllable?
- Are results saved in a parseable format (JSON/CSV)?
- Does the code match FINAL_PROPOSAL.md's method description?

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

Skip this step if CODE_REVIEW is false.

Before deploying, send the experiment code to GPT-5.5 xhigh for review:

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    Review the following experiment implementation for correctness.

    ## Experiment Plan:
    [paste key sections from EXPERIMENT_PLAN.md]

    ## Method Description:
    [paste from FINAL_PROPOSAL.md]

    ## Implementation:
    [paste the experiment scripts]

    Check for:
    1. Does the code correctly implement the method described in the proposal?
    2. Are all hyperparameters from the plan reflected in the code?
    3. Are there any logic bugs (wrong loss function, incorrect data split, missing eval)?
    4. Is the evaluation metric computed correctly?
    5. **CRITICAL: Does evaluation use the dataset's actual ground truth labels — NOT another model's output as ground truth?** This is a common and severe bug.
    6. Any potential issues (OOM risk, numerical instability, missing seeds)?

    For each issue found, specify: CRITICAL / MAJOR / MINOR and the exact fix.

On review results:

No CRITICAL issues → proceed to Phase 3
CRITICAL issues found → fix them, then re-submit for review (max 2 rounds)
Codex MCP unavailable → skip silently, proceed to Phase 3 (graceful degradation)

Phase 3: Sanity Check (if SANITY_FIRST = true)

Before deploying the full experiment suite, run the sanity-stage experiment:

/run-experiment [sanity experiment command]

Wait for completion. Verify:

Training loop runs without errors
Metrics are computed and saved correctly
GPU memory usage is within bounds
Output format matches expectations

If sanity fails → auto-debug before giving up (max 3 attempts):

Read the error — parse traceback, stderr, and log files
Diagnose — classify the failure:
- OOM → reduce batch size or enable gradient checkpointing
- ImportError → install missing package
- FileNotFoundError → fix path or download data
- CUDA error → check GPU availability, reduce model size
- NaN/divergence → reduce learning rate, check data preprocessing
Fix and re-run — apply the fix, re-run sanity
Attempt 2+ still failing? → Call in Codex rescue (if Codex plugin installed): Before the next retry, invoke /codex:rescue to get a second opinion on the root cause. Codex independently reads the code and error logs — it may spot issues Claude missed (wrong tensor shapes, subtle import shadowing, config mismatches, etc.). Apply its suggested fix, then re-run.
- If /codex:rescue is not available (plugin not installed), continue with Claude's own diagnosis
Still failing after 3 attempts? → stop, report the failure with all attempted fixes and error logs. Do not proceed with broken code.

Never give up on the first failure. Most experiment crashes are fixable without human intervention.

Phase 4: Deploy Full Experiments

Deploy experiments following the plan's milestone order. Route by job count:

Small batch (≤5 jobs per milestone) → use /run-experiment directly:

/run-experiment [experiment commands]

Large batch (≥10 jobs, multi-seed sweeps, or phase dependencies) → use /experiment-queue for proper orchestration:

/experiment-queue [grid spec or manifest]

Auto-routing rule: if any milestone in EXPERIMENT_PLAN.md declares ≥10 jobs (e.g., `seeds: [42,

experiment-bridge

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday