Autoresearch ML: Autonomous LLM Training Optimization

An autonomous experiment loop for single-GPU LLM pretraining. Edit train.py → commit → run 5-minute training → measure val_bpb → keep improvement or revert → repeat forever.

This skill is self-contained — it includes everything needed to set up and run the loop.

Setup Phase

1. Copy Template Assets

Copy the bundled training template to the project directory:

cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .

2. Install and Prepare

uv sync                    # Install dependencies
uv run prepare.py          # Download data shards, train tokenizer (~2 min)

3. Verify GPU

nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

4. Initialize the Experiment Session

Create a branch: git checkout -b autoresearch/<tag>-<date>

Ensure session files are gitignored (critical — git revert will fail if tracked):

echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"

Read prepare.py and train.py thoroughly to understand the codebase
Write autoresearch.md — a living session document recording goal, metrics, files in scope, constraints, and learnings
Write autoresearch.sh — the benchmark script (see Benchmark Script section below)
Commit session files
Run baseline: bash autoresearch.sh
Parse metrics from output (lines matching METRIC name=value)
Record baseline in autoresearch.jsonl:
- First write a config header: {"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}
- Then record the baseline result
Begin the experiment loop

The Experiment Loop

LOOP FOREVER. Never ask "should I continue?" — just keep going.

The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read train.py for new angles, try combining previous near-misses, try more radical architectural changes.

Each iteration:

1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat

Decision Rules

val_bpb improved (lower) → keep (commit stays, branch advances)
val_bpb equal or worse → discard (run git revert $(git rev-parse HEAD) --no-edit)
Crash (OOM, CUDA error, NaN loss) → discard (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
Simpler code for equal val_bpb → keep (removing complexity is a win)
Catastrophic VRAM increase → consider discard even if val_bpb improved slightly

Simplicity Criterion

All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.

Constraints

Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
Single file modification. Only train.py changes; prepare.py is immutable. This ensures fair comparison (same data, same evaluation).
VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
No new packages. You can only use what's already in pyproject.toml.
Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.

Don't Thrash

If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read train.py for new angles. Try a fundamentally different approach.

Handling User Messages

If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.

Logging to autoresearch.jsonl

Each experiment appends one JSON line:

{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}

Use the shared logging script:

bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'

Parse metrics from benchmark output:

bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh

Valid statuses: keep, discard, crash, checks_failed

ASI (Actionable Side Information)

ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.

Record ASI for every experiment:

{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}

Resuming After Context Reset

If autoresearch.jsonl and autoresearch.md exist in the working directory:

Read autoresearch.md for full context (goal, metrics, files, constraints, learnings)
Read autoresearch.jsonl to see all past experiments, current best, and ASI annotations
Check git log to verify current branch state matches expected state
If git state is dirty (unclean shutdown), revert uncommitted changes
Resume the loop from where it left off — no re-setup needed
Resume immediately — do not ask "should I continue?"

Confidence Scoring

After 3+ experiments, assess whether improvements are real or noise:

Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
Confidence = |best improvement| / MAD
≥2.0× → likely real improvement
1.0–2.0× → marginal, could be noise
<1.0× → within noise floor

ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.

Template Architecture

prepare.py (FIXED — never modify)

Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
Evaluation: evaluate_bpb() computes bits-per-byte (vocab-size-independent metric)

Key constants: MAX_SEQ_LEN = 2048, TIME_BUDGET = 300, EVAL_TOKENS = 40 * 524288, VOCAB_SIZE = 8192

train.py (MODIFIED BY AGENT — the only editable file)

Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
Optimizer: Hybrid MuonAdamW (Muon for matr

autoresearch-ml

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday