SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

openmythos

Desenvolvimento

Use this skill when the user is working with the OpenMythos codebase — a PyTorch implementation of a hypothesized Recurrent-Depth Transformer (RDT) architecture by Kye Gomez. Trigger on any of these signals: files named `main.py` in an `open_mythos/` directory; imports like `from open_mythos.main import OpenMythos, MythosConfig`; variant helpers `mythos_1b`/`mythos_3b`/`mythos_10b`/`mythos_50b`/`m

5estrelas
Ver no GitHub ↗Autor: SarthakDzLicença: MIT

OpenMythos

OpenMythos is an open-source PyTorch reconstruction of a hypothesized Claude Mythos architecture, written by Kye Gomez (github.com/kyegomez/OpenMythos, MIT license). It implements a Recurrent-Depth Transformer (RDT) with three stages — Prelude (standard transformer blocks, run once), a Recurrent Block (one TransformerBlock looped up to max_loop_iters times with input injection at every step), and a Coda (standard transformer blocks, run once). Attention is switchable between GQA and MLA; the FFN inside the recurrent block is a fine-grained MoE with always-on shared experts.

The project is an independent, theoretical reconstruction. It is not affiliated with Anthropic. The README is careful with language like "suspected", "likely", and "most probable class of solution", and so is this skill — don't claim this is what Anthropic actually does internally. If the user conflates OpenMythos with real Claude internals, gently correct them.

This skill turns Claude into a careful senior engineer who knows this specific repo. That's the entire job. (There is an optional experimental appendix at the bottom for users who want Claude to roleplay reasoning in the RDT's Prelude → Loop → Coda shape, but it is off by default.)

The repo at a glance

open_mythos/
├── main.py         — MythosConfig, OpenMythos, all nn.Module classes (RMSNorm, GQAttention,
│                     MLAttention, MoEFFN, Expert, TransformerBlock, LoRAAdapter, LTIInjection,
│                     ACTHalting, RecurrentBlock), RoPE helpers, loop_index_embedding
├── variants.py     — mythos_1b / 3b / 10b / 50b / 100b / 500b / 1t preset configs
├── tokenizer.py    — MythosTokenizer wrapper (defaults to openai/gpt-oss-20b via HF)
└── __init__.py     — public re-exports

training/3b_fine_web_edu.py  — reference training script (DDP-ready via torchrun, FineWeb-Edu)
tests/                        — test_main.py, test_tokenizer.py, bench_vs_transformer.py,
                                small_benchmark.py, test_rope_debug.py
docs/                         — open_mythos.md (full class reference), datasets.md
examples/                     — moda_example.py, variants_example.py
example.py                    — minimal end-to-end sanity script at repo root

The forward pass — hold this in your head

input_ids
  ↓ embed
  ↓ Prelude: prelude_layers × TransformerBlock (dense SwiGLU FFN, no MoE)
  e = x  ← encoded input is frozen here, re-injected every loop
  ↓
  RecurrentBlock (one block, looped up to n_loops times; uses MoE FFN):
    for t in range(n_loops):
        h_loop = loop_index_embedding(h, t, dim//8)   # RoPE-like signal on a slice of channels
        combined = RMSNorm(h_loop + e)                 # input injection into normed stream
        trans_out = TransformerBlock(combined) + LoRAAdapter(trans_out, t)   # per-depth LoRA delta
        h = A · h + B · e + trans_out                 # LTI-stable update (see below)
        p = sigmoid(halt(h))                          # ACT per-position halting probability
        # ACT remainder trick: if cumulative_p + p ≥ threshold, emit (1 - cumulative_p) as weight
        # gate by still_running so each position contributes exactly once on its halting step
        h_out += weight · h
  ↓
  Coda: coda_layers × TransformerBlock (dense SwiGLU FFN, no MoE)
  ↓ RMSNorm → LM head (weight-tied with embedding) → logits

Autoregressive generation uses KV caching with a separate cache key per loop depth (recurrent_loop_{t}) so every loop at every decode step finds populated keys.

Non-negotiable invariants — if you break these, the model breaks

  1. ρ(A) < 1 always. The entire reason LTIInjection exists is to guarantee this by construction. A = exp(-exp(log_dt + log_A)) sits element-wise in (0, 1). Never replace this with a free parameter, never initialize A as a raw nn.Parameter of shape (dim,), never remove the clamp(-20, 20) — that clamp exists so log_dt → -∞, log_A → +∞ doesn't produce 0 · inf = NaN. If the user sees spectral-radius drift or residual explosion, this is the first thing to check.
  2. e is frozen across loops. e is set once after the Prelude and re-injected at every loop iteration. This is what prevents drift across arbitrary recurrence depth. If someone accidentally recomputes e inside the loop, they have silently changed the architecture.
  3. MoE lives only in the Recurrent Block. Prelude and Coda use dense SwiGLU FFNs (use_moe=False). The recurrent block uses use_moe=True. This separation is intentional: MoE provides breadth across domains inside the looped core; the Prelude/Coda are thin encode/decode shells.
  4. Weight-tying on the LM head. self.head.weight = self.embed.weight. Don't break this by reinitializing head after construction.
  5. Causal mask dtype matches activation dtype. The _causal_mask static method explicitly takes dtype because a bf16 activation stream with an fp32 additive mask silently upcasts attention logits to fp32, then the attn-vs-V matmul breaks. If you see a dtype error in the attention kernel, this is the usual suspect.
  6. Loop-index embedding occupies a slice of channels, not all of them. self.loop_dim = cfg.dim // 8. The idea is that only a fraction of the residual stream carries the loop-index signal, leaving the rest undisturbed. Don't promote this to full-dim.
  7. ACT remainder trick with still_running gating. When act_threshold < 1.0 (it's 0.99 by default), a naive cumulative-probability update leaks a non-zero remainder on every subsequent step. The still_running = ~halted gate ensures each position contributes its halting weight exactly once. Don't remove it "to simplify".
  8. Don't break the loop when a KV cache is present. If kv_cache is None and all positions have halted, breaking is fine. With a cache, every loop depth must execute on every prefill/decode step so that later decode steps find populated keys at every cache_key. This is explicit in RecurrentBlock.forward.

Conventions used throughout main.py

  • nn.Module subclasses have full docstrings with Args/Returns. Match the style when adding new modules; don't regress to terse or missing docstrings.
  • RMSNorm, never LayerNorm.
  • RoPE is applied to Q and K before KV caching, so cached values don't need to be re-rotated on retrieval. Keep this ordering.
  • GQA uses the full per-head dim for RoPE; MLA uses only qk_rope_head_dim (the decoupled/split-RoPE scheme). The model registers two separate freqs_cis buffers and selects the right one based on cfg.attn_type. If you add a third attention type, register its own freqs buffer.
  • Flash Attention 2 is optional. GQAttention probes _HAS_FLASH_ATTN and falls back transparently to manual SDPA. Keep the fallback path — CPU tests run without flash-attn.
  • Weight init: N(0, 0.02) for every nn.Linear and nn.Embedding. Don't add per-layer init schemes without explicit reason.
  • Dropout defaults to 0.0 (research default for pretraining sanity runs); 0.1 is standard when the user actually trains.

Variant-scaling discipline

When asked to add or tune a scale variant in variants.py, stay consistent with the existing table: dim, n_heads roughly dim // 128, n_kv_heads roughly n_heads // 4 (GQA) or 8–16 for large MLA, expert_dim solved from the residual parameter budget after all other terms. The header comment in variants.py is authoritative:

total ≈ embed + prelude/coda dense blocks + recurrent MLA + MoE
MoE   = 3 * dim * expert_dim * (n_experts + n_shared * n_experts_per_tok)

Don't blindly copy a smaller config up — larger scales intentionally bump n_shared_experts, n_experts_per_tok, and lora_rank, and the 100B+ tier raises rope_theta and enables max_output_tokens=131072.

Training script conventions (training/3b_fine_web_edu.py)

  • AdamW, linear warmup (2000 steps) →

Como adicionar

/plugin marketplace add SarthakDz/OpenMythos-Skill

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.