Experiment Queue
⏱ External cadence: visibility only. This skill already runs its own detached server-side scheduler (60s poll +
depends_on+ wave transitions). Use its status output for overnight visibility (N done / N running / N pending); do not wrap it in a second/loop/CronCreatepoll — that duplicates the scheduler on an uncoordinated clock and races the wave-transition logic it was built to prevent. Seeshared-references/external-cadence.md("don't duplicate an existing scheduler").
Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
When to Use This Skill
Use when /run-experiment is insufficient:
- ≥10 jobs that need batching across GPUs
- Multi-seed sweeps (e.g., 21 seeds × 12 cells)
- Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
- Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
- OOM-prone configs where you need to retry with different GPU or wait
- Mixed seed grids where failed cells need re-running
Do NOT use for:
- Single ad-hoc experiment (use
/run-experiment) - Modal/Vast.ai deployments (those have their own orchestration)
- Experiments that need manual inspection between runs
Why This Exists
Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
- Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
- OOM on shared GPU — previous job's memory not yet released
- Wave race — new wave launches before previous wave fully settles
- Missing checkpoints — student launches before teacher saved
- Parser duplication — rewriting multi-seed analysis python every batch
All of these are pure engineering friction that can be orchestrated.
Core Concepts
Job Manifest
A manifest lists jobs with explicit state:
project: my_grid_experiment
cwd: /home/user/your_project
conda: my_env
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: gpu-server
default_cmd: >
python run_distill.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists
path: checkpoints/transformer/teacher_L96_K500_N{N}.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K
args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K
args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
# ... 14 more
Job State Machine
pending → running → completed
↘ failed_oom → pending (after delay) [retry up to N]
↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending
Wave Orchestration
A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
- All current-wave python processes have exited
- No stale screens remain for current-wave tags
- GPU memory has dropped below threshold (≤500 MiB)
- Precondition checks pass for next-wave jobs
Workflow
Step 1: Parse Manifest / Build from Grid
Input can be:
- YAML manifest (explicit job list, recommended for complex cases)
- Grid spec (Cartesian product of param values, e.g.,
N=[64,128,256] × n=[50K,150K,500K,652K]) - Natural language description (Claude parses into manifest)
Bind the run identifiers once so every later step (manifest save, scp, launch, monitor, resume) refers to the same paths. Set these as local shell variables before generating the manifest:
# REPLACE the placeholder path before running, or pre-export PROJECT_DIR:
PROJECT_DIR="${PROJECT_DIR:?set PROJECT_DIR to the local project root}"
RUN_TS=$(date -u +%Y%m%dT%H%M%SZ) # one timestamp per run, reused everywhere
LOCAL_RUN_DIR="$PROJECT_DIR/experiment_queue/$RUN_TS"
mkdir -p "$LOCAL_RUN_DIR"
Save the built manifest to $LOCAL_RUN_DIR/manifest.json for reproducibility.
Step 2: Pre-flight
- Check SSH connection works
- Check conda env exists on remote
- Check
cwdexists on remote - Check all preconditions (checkpoints, input files)
- Check GPU availability (at least
max_parallelfree GPUs)
If any precondition fails, show user which jobs are blocked and why.
Step 3: Launch Scheduler
The canonical scheduler implementation lives in skills/experiment-queue/scripts/queue_manager.py (Phase 3.3 move, Arch C). tools/experiment_queue/queue_manager.py is now a Python os.execv shim retained for legacy resolver-chain compatibility. Three preliminaries before launch.
3a. Resolve the local helper directory. The two helpers (queue_manager.py, build_manifest.py) now sit under skills/experiment-queue/scripts/ in the ARIS repo, with shims at tools/experiment_queue/ for legacy resolver layers. Use this hybrid chain so the skill works from any project layout:
# Layer 0: self-contained (CC 1.0+ exposes $CLAUDE_SKILL_DIR).
QUEUE_TOOLS=""
if [ -n "${CLAUDE_SKILL_DIR:-}" ] && [ -f "$CLAUDE_SKILL_DIR/scripts/queue_manager.py" ]; then
QUEUE_TOOLS="$CLAUDE_SKILL_DIR/scripts"
fi
# Layers 1-3: legacy chain via tools/experiment_queue/ shims.
if [ -z "$QUEUE_TOOLS" ]; then
cd "$(git rev-parse --show-toplevel 2>/dev/null || pwd)" || exit 1
if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills.txt ]; then
ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills.txt 2>/dev/null) || true
fi
QUEUE_TOOLS=".aris/tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS="tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || { [ -n "${ARIS_REPO:-}" ] && QUEUE_TOOLS="$ARIS_REPO/tools/experiment_queue"; }
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS=""
fi
[ -z "$QUEUE_TOOLS" ] && { echo "ERROR: experiment_queue helpers not found (layer 0: \$CLAUDE_SKILL_DIR/scripts/; layers 1-3: .aris/tools/, tools/, \$ARIS_REPO/tools/). Rerun install_aris.sh, set ARIS_REPO, or copy the canonical scripts from \$ARIS_REPO/skills/experiment-queue/scripts/." >&2; exit 1; }
The .aris/tools symlink is set up by install_aris.sh (#174). Older installs without that symlink fall through to tools/experiment_queue (works if invoked from inside the ARIS repo) or $ARIS_REPO/tools/experiment_queue. After Phase 3.3, each of those legacy paths contains a Python os.execv shim that forwards to the canonical skills/experiment-queue/scripts/ location, so existing users do not need to re-run anything.
3b. Compute remote paths. Use both a remote-relative form (for scp destinations — modern scp runs in SFTP mode and does NOT reliably expand $HOME in destination paths) and a $HOME-prefixed form (for ssh ... command strings, where remote bash WILL expand $HOME):
REMOTE_RUN_REL=".aris_queue/runs/$RUN_TS" # for scp destinations (relative to remote home)
REMOTE_RUN_DIR="\$HOME/$REMOTE_RUN_REL" # for ssh command strings (literal $HOME, expanded on remote)
3c. Bootstrap the remote run directory and copy helpers + manifest. Per-invocation and idempotent. Use a unique run directory rather than /tmp so concurrent queues do not collide and so resume-after-crash is reproducible.
ssh <server> "mkdir -p \"$REMOTE_RUN_DIR/logs\" \"\$HOME/.aris_queue\""
scp "$QUEUE_TOOLS/queue_manager.py" "$QUEUE_TOOLS/build_man