Autoresearch ML: Autonomous LLM Training Optimization
An autonomous experiment loop for single-GPU LLM pretraining. Edit train.py → commit → run 5-minute training → measure val_bpb → keep improvement or revert → repeat forever.
This skill is self-contained — it includes everything needed to set up and run the loop.
Setup Phase
1. Copy Template Assets
Copy the bundled training template to the project directory:
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .
2. Install and Prepare
uv sync # Install dependencies
uv run prepare.py # Download data shards, train tokenizer (~2 min)
3. Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
4. Initialize the Experiment Session
- Create a branch:
git checkout -b autoresearch/<tag>-<date> - Ensure session files are gitignored (critical —
git revertwill fail if tracked):echo -e "autoresearch.jsonl\nrun.log" >> .gitignore git add .gitignore && git commit -m "autoresearch: add session files to gitignore" - Read
prepare.pyandtrain.pythoroughly to understand the codebase - Write
autoresearch.md— a living session document recording goal, metrics, files in scope, constraints, and learnings - Write
autoresearch.sh— the benchmark script (see Benchmark Script section below) - Commit session files
- Run baseline:
bash autoresearch.sh - Parse metrics from output (lines matching
METRIC name=value) - Record baseline in
autoresearch.jsonl:- First write a config header:
{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"} - Then record the baseline result
- First write a config header:
- Begin the experiment loop
The Experiment Loop
LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read train.py for new angles, try combining previous near-misses, try more radical architectural changes.
Each iteration:
1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat
Decision Rules
- val_bpb improved (lower) →
keep(commit stays, branch advances) - val_bpb equal or worse →
discard(rungit revert $(git rev-parse HEAD) --no-edit) - Crash (OOM, CUDA error, NaN loss) →
discard(revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on. - Simpler code for equal val_bpb →
keep(removing complexity is a win) - Catastrophic VRAM increase → consider
discardeven if val_bpb improved slightly
Simplicity Criterion
All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.
Constraints
- Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
- Single file modification. Only
train.pychanges;prepare.pyis immutable. This ensures fair comparison (same data, same evaluation). - VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
- No new packages. You can only use what's already in
pyproject.toml. - Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.
Don't Thrash
If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read train.py for new angles. Try a fundamentally different approach.
Handling User Messages
If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.
Logging to autoresearch.jsonl
Each experiment appends one JSON line:
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}
Use the shared logging script:
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 2 \
--commit "$(git rev-parse --short HEAD)" \
--metric 0.993 \
--status keep \
--description "increase LR to 0.04" \
--metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
--segment 0 \
--asi '{"hypothesis":"higher LR converges faster"}'
Parse metrics from benchmark output:
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
Valid statuses: keep, discard, crash, checks_failed
ASI (Actionable Side Information)
ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.
Record ASI for every experiment:
{
"hypothesis": "Deeper model with fewer steps should compress better",
"arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
"result": "val_bpb improved 0.998→0.992, but 2x VRAM",
"next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}
Resuming After Context Reset
If autoresearch.jsonl and autoresearch.md exist in the working directory:
- Read
autoresearch.mdfor full context (goal, metrics, files, constraints, learnings) - Read
autoresearch.jsonlto see all past experiments, current best, and ASI annotations - Check git log to verify current branch state matches expected state
- If git state is dirty (unclean shutdown), revert uncommitted changes
- Resume the loop from where it left off — no re-setup needed
- Resume immediately — do not ask "should I continue?"
Confidence Scoring
After 3+ experiments, assess whether improvements are real or noise:
- Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
- Confidence = |best improvement| / MAD
- ≥2.0× → likely real improvement
- 1.0–2.0× → marginal, could be noise
- <1.0× → within noise floor
ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.
Template Architecture
prepare.py (FIXED — never modify)
- Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
- Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
- Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
- Evaluation:
evaluate_bpb()computes bits-per-byte (vocab-size-independent metric)
Key constants: MAX_SEQ_LEN = 2048, TIME_BUDGET = 300, EVAL_TOKENS = 40 * 524288, VOCAB_SIZE = 8192
train.py (MODIFIED BY AGENT — the only editable file)
- Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
- Optimizer: Hybrid MuonAdamW (Muon for matr