Run Experiment
Deploy and run ML experiment: $ARGUMENTS
Workflow
Step 1: Detect Environment
Read the project's CLAUDE.md to determine the experiment environment:
- Local GPU (
gpu: local): Look for local CUDA/MPS setup info - Remote server (
gpu: remote): Look for SSH alias, conda env, code directory - Vast.ai (
gpu: vast): Check forvast-instances.jsonat project root — if a running instance exists, use it. Also checkCLAUDE.mdfor a## Vast.aisection. - Modal (
gpu: modal): Serverless GPU via Modal. No SSH, no Docker, auto scale-to-zero. Delegate to/serverless-modal.
Modal detection: If CLAUDE.md has gpu: modal or a ## Modal section, the entire deployment is handled by /serverless-modal. Jump to Step 4: Deploy (Modal) — Steps 2-3 are not needed (Modal handles code sync and GPU allocation automatically).
Vast.ai detection priority:
- If
CLAUDE.mdhasgpu: vastor a## Vast.aisection:- If
vast-instances.jsonexists and has a running instance → use that instance - If no running instance → call
/vast-gpu provisionwhich analyzes the task, presents cost-optimized GPU options, and rents the user's choice
- If
- If no server info is found in
CLAUDE.md, ask the user.
Step 2: Pre-flight Check
Check GPU availability on the target machine:
Remote (SSH):
ssh <server> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
Remote (Vast.ai):
ssh -p <PORT> root@<HOST> nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
(Read ssh_host and ssh_port from vast-instances.json, or run vastai ssh-url <INSTANCE_ID> which returns ssh://root@HOST:PORT)
Local:
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
# or for Mac MPS:
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
Free GPU = memory.used < 500 MiB.
Step 3: Sync Code (Remote Only)
Check the project's CLAUDE.md for a code_sync setting. If not specified, default to rsync.
Option A: rsync (default)
Only sync necessary files — NOT data, checkpoints, or large files:
rsync -avz --include='*.py' --exclude='*' <local_src>/ <server>:<remote_dst>/
Option B: git (when code_sync: git is set in CLAUDE.md)
Push local changes to remote repo, then pull on the server:
# 1. Push from local
git add -A && git commit -m "sync: experiment deployment" && git push
# 2. Pull on server
ssh <server> "cd <remote_dst> && git pull"
Benefits: version-tracked, multi-server sync with one push, no rsync include/exclude rules needed.
Option C: Vast.ai instance
Sync code to the vast.ai instance (always rsync, code dir is /workspace/project/):
rsync -avz -e "ssh -p <PORT>" \
--include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' \
--include='*.txt' --include='*.sh' --include='*/' \
--exclude='*.pt' --exclude='*.pth' --exclude='*.ckpt' \
--exclude='__pycache__' --exclude='.git' --exclude='data/' \
--exclude='wandb/' --exclude='outputs/' \
./ root@<HOST>:/workspace/project/
If requirements.txt exists, install dependencies:
scp -P <PORT> requirements.txt root@<HOST>:/workspace/
ssh -p <PORT> root@<HOST> "pip install -q -r /workspace/requirements.txt"
Step 3.5: W&B Integration (when wandb: true in CLAUDE.md)
Skip this step entirely if wandb is not set or is false in CLAUDE.md.
Before deploying, ensure the experiment scripts have W&B logging:
-
Check if wandb is already in the script — look for
import wandborwandb.init. If present, skip to Step 4. -
If not present, add W&B logging to the training script:
import wandb wandb.init(project=WANDB_PROJECT, name=EXP_NAME, config={...hyperparams...}) # Inside training loop: wandb.log({"train/loss": loss, "train/lr": lr, "step": step}) # After eval: wandb.log({"eval/loss": eval_loss, "eval/ppl": ppl, "eval/accuracy": acc}) # At end: wandb.finish() -
Metrics to log (add whichever apply to the experiment):
train/loss— training loss per steptrain/lr— learning rateeval/loss,eval/ppl,eval/accuracy— eval metrics per epochgpu/memory_used— GPU memory (viatorch.cuda.max_memory_allocated())speed/samples_per_sec— throughput- Any custom metrics the experiment already computes
-
Verify wandb login on the target machine:
ssh <server> "wandb status" # should show logged in # If not logged in: ssh <server> "wandb login <WANDB_API_KEY>"
The W&B project name and API key come from
CLAUDE.md(see example below). The experiment name is auto-generated from the script name + timestamp.
Step 4: Deploy
Remote (via SSH + screen)
For each experiment, create a dedicated screen session with GPU binding:
ssh <server> "screen -dmS <exp_name> bash -c '\
eval \"\$(<conda_path>/conda shell.bash hook)\" && \
conda activate <env> && \
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>'"
Vast.ai instance
No conda needed — the Docker image has the environment. Use /workspace/project/ as working dir:
ssh -p <PORT> root@<HOST> "screen -dmS <exp_name> bash -c '\
cd /workspace/project && \
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee /workspace/<log_file>'"
After launching, update the experiment field in vast-instances.json for this instance.
Modal (serverless)
When gpu: modal is detected, delegate to /serverless-modal:
- Analyze task — determine VRAM needs, choose GPU, estimate cost
- Generate launcher — create a
modal_launcher.pythat wraps the training script usingmodal.Mount.from_local_dirfor code andmodal.Volumefor results - Run —
modal run modal_launcher.py(runs locally, GPU executes remotely) - Collect results — results return via Volume or stdout, no manual download needed
Key Modal settings from CLAUDE.md:
modal_gpu: GPU override (default: auto-select based on VRAM analysis)modal_timeout: Max seconds (default: 21600 = 6 hours)modal_volume: Named volume for persistent results
No SSH, no code sync, no screen sessions needed. Modal handles everything.
Local
# Linux with CUDA
CUDA_VISIBLE_DEVICES=<gpu_id> python <script> <args> 2>&1 | tee <log_file>
# Mac with MPS (PyTorch uses MPS automatically)
python <script> <args> 2>&1 | tee <log_file>
For local long-running jobs, use run_in_background: true to keep the conversation responsive.
Step 5: Verify Launch
Remote (SSH):
ssh <server> "screen -ls"
Remote (Vast.ai):
ssh -p <PORT> root@<HOST> "screen -ls"
Modal:
modal app list # Check app is running
modal app logs <app> # Stream logs
Local: Check process is running and GPU is allocated.
Step 6: Feishu Notification (if configured)
After deployment is verified, check ~/.claude/feishu.json:
- Send
experiment_donenotification: which experiments launched, which GPUs, estimated time - If config absent or mode
"off": skip entirely (no-op)
Step 7: Auto-Destroy Vast.ai Instance (when gpu: vast and auto_destroy: true)
Skip this step if not using vast.ai or auto_destroy is false.
After the experiment completes (detected via /monitor-experiment or screen session ending):
-
Download results from the instance:
rsync -avz -e "ssh -p <PORT>" root@<HOST>:/workspace/project/results/ ./results/ -
Download logs:
scp -P <PORT> root@<HOST>:/workspace/*.log ./logs/ -
Destroy the instance to stop billing:
vastai destroy instance <INSTANCE_ID> -
Update
vast-instances.json— mark status asdestroyed. -
Report cost:
Vast.ai instance <ID