OpenMythos
OpenMythos is an open-source PyTorch reconstruction of a hypothesized Claude Mythos architecture, written by Kye Gomez (github.com/kyegomez/OpenMythos, MIT license). It implements a Recurrent-Depth Transformer (RDT) with three stages — Prelude (standard transformer blocks, run once), a Recurrent Block (one TransformerBlock looped up to max_loop_iters times with input injection at every step), and a Coda (standard transformer blocks, run once). Attention is switchable between GQA and MLA; the FFN inside the recurrent block is a fine-grained MoE with always-on shared experts.
The project is an independent, theoretical reconstruction. It is not affiliated with Anthropic. The README is careful with language like "suspected", "likely", and "most probable class of solution", and so is this skill — don't claim this is what Anthropic actually does internally. If the user conflates OpenMythos with real Claude internals, gently correct them.
This skill turns Claude into a careful senior engineer who knows this specific repo. That's the entire job. (There is an optional experimental appendix at the bottom for users who want Claude to roleplay reasoning in the RDT's Prelude → Loop → Coda shape, but it is off by default.)
The repo at a glance
open_mythos/
├── main.py — MythosConfig, OpenMythos, all nn.Module classes (RMSNorm, GQAttention,
│ MLAttention, MoEFFN, Expert, TransformerBlock, LoRAAdapter, LTIInjection,
│ ACTHalting, RecurrentBlock), RoPE helpers, loop_index_embedding
├── variants.py — mythos_1b / 3b / 10b / 50b / 100b / 500b / 1t preset configs
├── tokenizer.py — MythosTokenizer wrapper (defaults to openai/gpt-oss-20b via HF)
└── __init__.py — public re-exports
training/3b_fine_web_edu.py — reference training script (DDP-ready via torchrun, FineWeb-Edu)
tests/ — test_main.py, test_tokenizer.py, bench_vs_transformer.py,
small_benchmark.py, test_rope_debug.py
docs/ — open_mythos.md (full class reference), datasets.md
examples/ — moda_example.py, variants_example.py
example.py — minimal end-to-end sanity script at repo root
The forward pass — hold this in your head
input_ids
↓ embed
↓ Prelude: prelude_layers × TransformerBlock (dense SwiGLU FFN, no MoE)
e = x ← encoded input is frozen here, re-injected every loop
↓
RecurrentBlock (one block, looped up to n_loops times; uses MoE FFN):
for t in range(n_loops):
h_loop = loop_index_embedding(h, t, dim//8) # RoPE-like signal on a slice of channels
combined = RMSNorm(h_loop + e) # input injection into normed stream
trans_out = TransformerBlock(combined) + LoRAAdapter(trans_out, t) # per-depth LoRA delta
h = A · h + B · e + trans_out # LTI-stable update (see below)
p = sigmoid(halt(h)) # ACT per-position halting probability
# ACT remainder trick: if cumulative_p + p ≥ threshold, emit (1 - cumulative_p) as weight
# gate by still_running so each position contributes exactly once on its halting step
h_out += weight · h
↓
Coda: coda_layers × TransformerBlock (dense SwiGLU FFN, no MoE)
↓ RMSNorm → LM head (weight-tied with embedding) → logits
Autoregressive generation uses KV caching with a separate cache key per loop depth (recurrent_loop_{t}) so every loop at every decode step finds populated keys.
Non-negotiable invariants — if you break these, the model breaks
ρ(A) < 1always. The entire reasonLTIInjectionexists is to guarantee this by construction.A = exp(-exp(log_dt + log_A))sits element-wise in (0, 1). Never replace this with a free parameter, never initializeAas a rawnn.Parameterof shape(dim,), never remove theclamp(-20, 20)— that clamp exists solog_dt → -∞, log_A → +∞doesn't produce0 · inf = NaN. If the user sees spectral-radius drift or residual explosion, this is the first thing to check.eis frozen across loops.eis set once after the Prelude and re-injected at every loop iteration. This is what prevents drift across arbitrary recurrence depth. If someone accidentally recomputeseinside the loop, they have silently changed the architecture.- MoE lives only in the Recurrent Block. Prelude and Coda use dense SwiGLU FFNs (
use_moe=False). The recurrent block usesuse_moe=True. This separation is intentional: MoE provides breadth across domains inside the looped core; the Prelude/Coda are thin encode/decode shells. - Weight-tying on the LM head.
self.head.weight = self.embed.weight. Don't break this by reinitializingheadafter construction. - Causal mask dtype matches activation dtype. The
_causal_maskstatic method explicitly takesdtypebecause a bf16 activation stream with an fp32 additive mask silently upcasts attention logits to fp32, then the attn-vs-V matmul breaks. If you see a dtype error in the attention kernel, this is the usual suspect. - Loop-index embedding occupies a slice of channels, not all of them.
self.loop_dim = cfg.dim // 8. The idea is that only a fraction of the residual stream carries the loop-index signal, leaving the rest undisturbed. Don't promote this to full-dim. - ACT remainder trick with
still_runninggating. Whenact_threshold < 1.0(it's 0.99 by default), a naive cumulative-probability update leaks a non-zero remainder on every subsequent step. Thestill_running = ~haltedgate ensures each position contributes its halting weight exactly once. Don't remove it "to simplify". - Don't
breakthe loop when a KV cache is present. Ifkv_cache is Noneand all positions have halted, breaking is fine. With a cache, every loop depth must execute on every prefill/decode step so that later decode steps find populated keys at everycache_key. This is explicit inRecurrentBlock.forward.
Conventions used throughout main.py
nn.Modulesubclasses have full docstrings with Args/Returns. Match the style when adding new modules; don't regress to terse or missing docstrings.- RMSNorm, never LayerNorm.
- RoPE is applied to Q and K before KV caching, so cached values don't need to be re-rotated on retrieval. Keep this ordering.
- GQA uses the full per-head dim for RoPE; MLA uses only
qk_rope_head_dim(the decoupled/split-RoPE scheme). The model registers two separatefreqs_cisbuffers and selects the right one based oncfg.attn_type. If you add a third attention type, register its own freqs buffer. - Flash Attention 2 is optional.
GQAttentionprobes_HAS_FLASH_ATTNand falls back transparently to manual SDPA. Keep the fallback path — CPU tests run without flash-attn. - Weight init:
N(0, 0.02)for everynn.Linearandnn.Embedding. Don't add per-layer init schemes without explicit reason. - Dropout defaults to 0.0 (research default for pretraining sanity runs); 0.1 is standard when the user actually trains.
Variant-scaling discipline
When asked to add or tune a scale variant in variants.py, stay consistent with the existing table: dim, n_heads roughly dim // 128, n_kv_heads roughly n_heads // 4 (GQA) or 8–16 for large MLA, expert_dim solved from the residual parameter budget after all other terms. The header comment in variants.py is authoritative:
total ≈ embed + prelude/coda dense blocks + recurrent MLA + MoE
MoE = 3 * dim * expert_dim * (n_experts + n_shared * n_experts_per_tok)
Don't blindly copy a smaller config up — larger scales intentionally bump n_shared_experts, n_experts_per_tok, and lora_rank, and the 100B+ tier raises rope_theta and enables max_output_tokens=131072.
Training script conventions (training/3b_fine_web_edu.py)
- AdamW, linear warmup (2000 steps) →