Experiment Queue

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

When to Use This Skill

Use when /run-experiment is insufficient:

≥10 jobs that need batching across GPUs
Multi-seed sweeps (e.g., 21 seeds × 12 cells)
Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
OOM-prone configs where you need to retry with different GPU or wait
Mixed seed grids where failed cells need re-running

Do NOT use for:

Single ad-hoc experiment (use /run-experiment)
Modal/Vast.ai deployments (those have their own orchestration)
Experiments that need manual inspection between runs

Why This Exists

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
OOM on shared GPU — previous job's memory not yet released
Wave race — new wave launches before previous wave fully settles
Missing checkpoints — student launches before teacher saved
Parser duplication — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

Core Concepts

Job Manifest

A manifest lists jobs with explicit state:

project: my_grid_experiment
cwd: /home/user/your_project
conda: my_env
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: gpu-server
default_cmd: >
  python run_distill.py --backbone softmax --lam 0.5
  --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:
  - type: checkpoint_exists
    path: checkpoints/transformer/teacher_L96_K500_N{N}.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500  # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
  delay: 120
  max_attempts: 3

jobs:
  - id: s200_N64_n50K
    args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  - id: s200_N128_n50K
    args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
  # ... 14 more

Job State Machine

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending

Wave Orchestration

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:

All current-wave python processes have exited
No stale screens remain for current-wave tags
GPU memory has dropped below threshold (≤500 MiB)
Precondition checks pass for next-wave jobs

Workflow

Step 1: Parse Manifest / Build from Grid

Input can be:

YAML manifest (explicit job list, recommended for complex cases)
Grid spec (Cartesian product of param values, e.g., N=[64,128,256] × n=[50K,150K,500K,652K])
Natural language description (Claude parses into manifest)

Bind run identifiers once so every later step refers to the same paths:

# REPLACE the placeholder path before running, or pre-export PROJECT_DIR:
PROJECT_DIR="${PROJECT_DIR:?set PROJECT_DIR to the local project root}"
RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
LOCAL_RUN_DIR="$PROJECT_DIR/experiment_queue/$RUN_TS"
mkdir -p "$LOCAL_RUN_DIR"

Save the built manifest to $LOCAL_RUN_DIR/manifest.json for reproducibility.

Step 2: Pre-flight

Check SSH connection works
Check conda env exists on remote
Check cwd exists on remote
Check all preconditions (checkpoints, input files)
Check GPU availability (at least max_parallel free GPUs)

If any precondition fails, show user which jobs are blocked and why.

Step 3: Launch Scheduler

Resolve the bundled helper directory ($PROJECT_DIR / $RUN_TS / $LOCAL_RUN_DIR already set in Step 1). Phase 3.3 (Arch C) moved the canonical scripts to skills/experiment-queue/scripts/; tools/experiment_queue/ retains os.execv shims for legacy resolver layers:

if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
    ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
[ -n "${ARIS_REPO:-}" ] || { echo "ERROR: ARIS_REPO not set. Use install_aris_codex.sh managed install or export ARIS_REPO=/path/to/ARIS."; exit 1; }
# Prefer the new canonical location; fall back to legacy tools/ shim path.
QUEUE_TOOLS="$ARIS_REPO/skills/experiment-queue/scripts"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS="$ARIS_REPO/tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || { echo "ERROR: queue_manager.py not found at $ARIS_REPO/skills/experiment-queue/scripts/ or $ARIS_REPO/tools/experiment_queue/"; exit 1; }

Compute remote paths (note: modern scp runs in SFTP mode and does NOT reliably expand $HOME in destination paths — use remote-relative for scp, $HOME-prefixed for ssh command strings):

REMOTE_RUN_REL=".aris_queue/runs/$RUN_TS"
REMOTE_RUN_DIR="\$HOME/$REMOTE_RUN_REL"

Bootstrap remote run dir + copy helpers + copy manifest. Per-invocation, idempotent:

ssh <server> "mkdir -p \"$REMOTE_RUN_DIR/logs\" \"\$HOME/.aris_queue\""
scp "$QUEUE_TOOLS/queue_manager.py" "$QUEUE_TOOLS/build_manifest.py" <server>:.aris_queue/
scp "$LOCAL_RUN_DIR/manifest.json" <server>:"$REMOTE_RUN_REL/manifest.json"

Launch the scheduler as a detached nohup process:

ssh <server> "nohup python3 \"\$HOME/.aris_queue/queue_manager.py\" \\
  --manifest \"$REMOTE_RUN_DIR/manifest.json\" \\
  --state    \"$REMOTE_RUN_DIR/queue_state.json\" \\
  --log-dir  \"$REMOTE_RUN_DIR/logs\" \\
  > \"$REMOTE_RUN_DIR/queue_mgr.log\" 2>&1 &"

Notes: --log-dir is what queue_manager.py actually consumes (per-job log files for OOM detection). Do NOT pass --log <path> — that flag is declared but unused.

Persist run identifiers for monitoring + resume (sourceable later):

{
  printf 'PROJECT_DIR=%q\n'    "$PROJECT_DIR"
  printf 'RUN_TS=%q\n'         "$RUN_TS"
  printf 'LOCAL_RUN_DIR=%q\n'  "$LOCAL_RUN_DIR"
  printf 'REMOTE_RUN_REL=%q\n' "$REMOTE_RUN_REL"
  printf 'REMOTE_RUN_DIR=%q\n' "$REMOTE_RUN_DIR"
} > "$LOCAL_RUN_DIR/run_meta.txt"

%q shell-escapes values; REMOTE_RUN_DIR keeps a literal $HOME (correct for later reuse inside ssh "...").

Resume an existing queue. Do NOT regenerate RUN_TS. Reload from run_meta.txt and re-run only the launch command above (not the bootstrap):

LOCAL_RUN_DIR="/abs/path/to/project/experiment_queue/<existing-run-ts>"
. "$LOCAL_RUN_DIR/run_meta.txt"
# Then re-run the launch command verbatim; do NOT re-run mkdir/scp.

The scheduler:

Reads manifest
Loops: for each pending job, assign to free GPU, launch via screen
Polls job status (every 60s)
Detects stale screens (python exited but screen detached → kill)
Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
Detects completion (expected output JSON/file exists) → mark completed
Launches next wave when current wave settles
Writes state to queue_state.json continuously

Step 4: Monitoring

User can check state anytime, using $REMOTE_RUN_DIR from Step 3 (or reload it from $LOCAL_RUN_DIR/run_meta.txt):

ssh <server> "cat \"$REMOTE_RUN_DIR/queue_state.json\"" \
  | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

Note: /monitor-experiment is currently focused on screen sessions, result JSONs, and W&B; it does not yet read queue_state.json direc

experiment-queue

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday