Workflow 1.5: Experiment Bridge

Implement and deploy experiments from plan: $ARGUMENTS

Overview

This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.

Workflow 1 output:                    This skill:                        Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md   →   implement → deploy → collect   →   initial results ready
refine-logs/EXPERIMENT_TRACKER.md     code        /run-experiment        for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md

Constants

AUTO_DEPLOY = true — Automatically deploy experiments after implementation. Set false to review code before deploying.
CODE_REVIEW = true — Secondary Codex reviewer with xhigh reasoning reviews experiment code before deployment. Catches logic bugs before wasting GPU hours. Set false to skip.
SANITY_FIRST = true — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early.
MAX_PARALLEL_RUNS = 4 — Maximum number of experiments to deploy in parallel (limited by available GPUs).
BASE_REPO = false — GitHub repo URL to use as a base codebase. When set, clone it first and implement experiments on top of it.
COMPACT = false — When true, prefer idea-stage/IDEA_CANDIDATES.md over the full idea-stage/IDEA_REPORT.md, and append completed runs to EXPERIMENT_LOG.md.
BACKENDS = local | ssh | vast | modal — Preserve the Claude mainline backend lifecycle. Vast.ai and Modal routes are first-class when configured; do not silently fall back to local execution if the user requested either backend.
RESCUE_ON_FAILURE = true — If sanity or deployment fails, run a Codex-native rescue / second opinion review before abandoning the experiment plan.

Override: /experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project

Inputs

This skill expects one or more of:

refine-logs/EXPERIMENT_PLAN.md (best) — claim-driven experiment roadmap from /experiment-plan
refine-logs/EXPERIMENT_TRACKER.md — run-by-run execution table
refine-logs/FINAL_PROPOSAL.md — method description for implementation context
idea-stage/IDEA_CANDIDATES.md — compact idea summary (preferred when COMPACT = true) (fall back to ./IDEA_CANDIDATES.md if not found)
idea-stage/IDEA_REPORT.md — fallback if refine-logs don't exist (fall back to ./IDEA_REPORT.md if not found)

If none exist, ask the user what experiments to implement.

Workflow

Phase 1: Parse the Experiment Plan

Read EXPERIMENT_PLAN.md and extract:

Run order and milestones — which experiments run first (sanity → baseline → main → ablation → polish)
For each experiment block:
- Dataset / split / task
- Compared systems and variants
- Metrics to compute
- Setup details (backbone, hyperparameters, seeds)
- Success criterion
- Priority (MUST-RUN vs NICE-TO-HAVE)
Compute budget — total estimated GPU-hours
Method details from FINAL_PROPOSAL.md — what exactly to implement

Present a brief summary:

📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]

Proceeding to implementation.

Phase 2: Implement Experiment Code

If BASE_REPO is set — clone the repo first:

git clone <BASE_REPO> base_repo/

For each milestone (in order), write the experiment scripts:

Check existing code — scan the project (or cloned base_repo/) for existing experiment scripts, model code, and data loaders. Reuse as much as possible.
Implement missing pieces:
- Training scripts with proper argparse (all hyperparameters configurable)
- Evaluation scripts computing the specified metrics
- Data loading / preprocessing if needed
- Baseline implementations if not already present
- Fixed random seeds for reproducibility
- Results saved to JSON/CSV for later analysis
- Proper logging (wandb if configured in AGENTS.md)
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
- Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse?
- Is the random seed fixed and controllable?
- Are results saved in a parseable format (JSON/CSV)?
- Does the code match FINAL_PROPOSAL.md's method description?
- CRITICAL: does evaluation compare predictions against dataset ground truth, never another model's output?

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

Skip this step if CODE_REVIEW is false.

Before deploying, send the experiment code to a secondary Codex reviewer with xhigh reasoning:

spawn_agent:
  reasoning_effort: xhigh
  message: |
    Review the following experiment implementation for correctness.

    ## Experiment Plan
    [paste key sections from EXPERIMENT_PLAN.md]

    ## Method Description
    [paste from FINAL_PROPOSAL.md]

    ## Implementation
    [paste the experiment scripts or exact file paths plus relevant snippets]

    Check for:
    1. Does the code correctly implement the method described in the proposal?
    2. Are all hyperparameters from the plan reflected in the code?
    3. Are there logic bugs: wrong loss, wrong data split, missing eval, leakage, metric mismatch?
    4. Is the evaluation metric computed against ground truth, not another model's output?
    5. Are seeds, result paths, logging, and failure handling sufficient for reproducible experiments?

    Output:
    - BLOCKING issues that must be fixed before deployment
    - NON-BLOCKING issues that can wait
    - Suggested patches or checks

If BLOCKING issues are found, fix them and re-run this review once before Phase 3. Save the reviewer response and any fixes in refine-logs/EXPERIMENT_CODE_REVIEW.md. If reviewer delegation is unavailable, run the same checklist locally and mark the review [local-only].

Phase 3: Sanity Check (if SANITY_FIRST = true)

Before deploying the full experiment suite, run the sanity-stage experiment:

/run-experiment [sanity experiment command]

Wait for completion. Verify:

Training loop runs without errors
Metrics are computed and saved correctly
GPU memory usage is within bounds
Output format matches expectations

If sanity fails → fix the code, re-run. Do not proceed to full deployment with broken code.

If the same sanity failure repeats, trigger a second opinion: summarize the plan, code diff, command, logs, backend, and failure, then ask a fresh Codex reviewer agent for a rescue diagnosis. Apply only concrete fixes grounded in the logs.

Phase 4: Deploy Full Experiments

Deploy experiments following the plan's milestone order. Route by job count and dependencies:

/run-experiment [experiment commands]

For large batches (≥10 jobs), multi-seed sweeps, or teacher→student phase dependencies, use the queue scheduler:

/experiment-queue [grid spec or manifest]

Auto-routing rule: if any milestone in EXPERIMENT_PLAN.md declares ≥10 jobs or declares phase dependencies, route that milestone to /experiment-queue; otherwise use /run-experiment. /experiment-queue adds OOM-aware retry with backoff, stale-screen cleanup, wave-transition race prevention, phase dependency enforcement, and crash-safe state persistence in queue_state.json.

For each milestone:

Deploy experiments in parallel (up to MAX_PARALLEL_RUNS for /run-experiment, or max_parallel from the queue manifest for /experiment-queue)
Use /monitor-experiment to track progress; if /experiment-queue is active, monitor queue_state.json
Collect results as experiments complete

Backend lifecycle rules:

Vast.ai: record instance id,

experiment-bridge

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday