What Is OpenMythos?
OpenMythos is an open-source Python and PyTorch LLM research framework built by Kye Gomez and the community to model a theoretical Claude Mythos-style Recurrent-Depth Transformer, and it is one of the best LLM Research Frameworks tools for ML researchers, AI engineers, and CTOs. It simulates a three-stage pipeline with a Prelude, a looped Recurrent Block, and a Coda, while shipping preconfigured scales from 1B to 1T parameters so teams can study depth-variable reasoning without writing a custom model stack. The repository is explicit that it is an independent reconstruction, not an Anthropic release.
What makes OpenMythos worth reading is that it translates a research hypothesis into code instead of treating the architecture as a slide deck. The implementation exposes attention modes, MoE routing, recurrence depth, and stability checks as first-class knobs, which makes it useful for ablation work, architecture comparisons, and model behavior experiments.
Quick Overview
| Attribute | Details |
|---|---|
| Type | LLM Research Frameworks |
| Best For | ML researchers, AI engineers, and CTOs |
| Language/Stack | Python, PyTorch, CUDA, FlashAttention 2 |
| License | N/A (not stated in the scraped page) |
| GitHub Stars | N/A as of Feb 2026 |
| Pricing | Open-Source |
| Last Release | N/A |
Who Should Use OpenMythos?
- Research engineers comparing standard decoder-only transformers against recurrent-depth designs, because OpenMythos exposes the loop count, injection path, and attention backend directly in config.
- ML platform teams that need a PyTorch-native reference implementation for MoE and attention experiments, especially if they already run
torchrun, DDP, and sharded datasets. - Indie AI founders who want to prototype adaptive-compute reasoning ideas without waiting for a managed vendor API or a closed model release.
- Architecture tinkerers validating whether MLA, GQA, and sparse expert routing change memory use or output quality under controlled settings.
Not ideal for:
- Teams that want a drop-in production model with pretrained weights, support SLAs, and turnkey hosting.
- Apps that need a simple inference wrapper and no interest in architecture research or training code.
- GPU-constrained users who cannot satisfy CUDA and build-tool requirements for the optional Flash Attention path.
Key Features of OpenMythos
- Three-stage Recurrent-Depth Transformer layout — OpenMythos splits computation into a Prelude, a looped Recurrent Block, and a Coda. That structure is the core experiment: the same hidden state can be refined across multiple passes instead of stacking a large number of unique layers.
- Configurable loop depth — The recurrent block runs for
n_loopsup tomax_loop_iters, which lets you test shallow and deep reasoning trajectories with the same parameter set. That is useful when you want to correlate loop count with output quality, latency, and memory use. - Switchable attention backends —
attn_typetoggles betweenmlaandgqa, so the same model family can be evaluated under different attention layouts. Thegqapath is paired with Flash Attention 2 when installed, and the code falls back to manual scaled dot-product attention if the package is absent. - Sparse MoE feed-forward path — The feed-forward stack uses routed experts plus shared experts through
n_experts,n_shared_experts, andn_experts_per_tok. That gives the architecture compute-adaptive behavior without requiring every token to activate every expert. - Explicit recurrence stability checks — The model exposes injection parameters such as
AandB, and the example code checks the spectral radius ofAwithtorch.linalg.eigvals(A). That matters because recurrence can drift if the hidden-state update is unstable. - Scale presets from 1B to 1T — The repo includes named configs like
mythos_1b,mythos_3b,mythos_10b, all the way tomythos_1t. Larger presets expand context to 1M tokens and output capacity to 128k, which makes the family interesting for long-context experiments. - Training script included — OpenMythos ships a dedicated training entrypoint for the 3B model on FineWeb-Edu. The script uses AdamW, linear warmup, cosine decay, PyTorch DDP, and bf16 on H100/A100, so you are not starting from a blank repo.
OpenMythos vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| OpenMythos | Recurrent-depth transformer research | Explicit Prelude / Recurrent Block / Coda design with MLA or GQA and MoE routing | Open-Source |
| Open R1 | Reasoning-focused model training and experiments | Better suited to reasoning and reinforcement research than architecture reconstruction | Open-Source |
| OpenSwarm | Multi-agent orchestration | Useful when the workflow is about coordinating agents, not inspecting model internals | Open-Source |
| Transformers | General-purpose model training and inference | Mature ecosystem, broad model support, and production-ready integration surface | Open-Source |
Pick Open R1 when your goal is to optimize reasoning behavior, training recipes, or post-training workflows rather than inspect a new model topology. Pick OpenSwarm when the problem is orchestration across multiple agents, tools, or tasks, because OpenMythos stays focused on the internals of a single recurrent model.
Use standard Transformers when you want a well-known baseline, a huge model zoo, and minimal friction for deployment or benchmarking. OpenMythos is the better fit when the question is architectural: does looping the same block multiple times change compute, memory, or reasoning quality?
How OpenMythos Works
OpenMythos turns the recurrent-depth hypothesis into a concrete data path. Input tokens go through a Prelude that builds the base hidden representation once, then the Recurrent Block reprocesses that state for multiple iterations, and finally the Coda resolves the result into logits; the key idea is that the model reuses weights instead of growing depth with unique layers.
The recurrent update is written as h_{t+1} = A·h_t + B·e + Transformer(h_t, e), where h_t is the current hidden state and e is the encoded input injected from the Prelude. That injection is the reason the recurrence does not lose the original prompt signal, and it is also why the example checks the spectral radius of A before trusting the run.
Attention and expert routing are the two other design decisions that matter. mla and gqa let you compare memory and compute trade-offs, while the sparse MoE feed-forward path activates only a subset of experts per token, which is the main reason the family can scale in a compute-adaptive way.
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=1000,
dim=256,
n_heads=8,
max_seq_len=128,
max_loop_iters=4,
prelude_layers=1,
coda_layers=1,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=64,
lora_rank=8,
attn_type='gqa',
n_kv_heads=2,
)
model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)
This example instantiates the smallest practical research configuration, runs a forward pass, and exercises the loop counter so you can inspect output shape and parameter count. If you switch to mla, the config uses the extra KV and RoPE head dimensions described in the repo, and if you install the optional Flash Attention package, gqa can use the CUDA-accelerated path.
The training side follows the same philosophy: make the architecture observable. The included training/3b_fine_web_edu.py script uses PyTorch DDP with torchrun, FineWeb-Edu streaming data, MythosTokenizer, and a warmup-plus-cosine schedule, so the codebase is aimed at controlled experiments rather than opaque model serving.
Pros and Cons of OpenMythos
Pros:
- Transparent architecture — the Prelude/Recurrent/Coda split is readable in code, which makes ablations and papers easier to defend.
- PyTorch-native implementation — you can use standard debugging, profiling, and distributed training tools without a custom runtime.
- Attention flexibility —
mlaandgqasupport side-by-side comparisons, and the Flash Attention 2 path is available for CUDA environments. - Sparse MoE routing — token-level expert selection makes it possible to explore compute-adaptive inference without activating every parameter.
- Large preset coverage — the family spans 1B to 1T configurations, which is unusually broad for a public research repo.
- Training assets included — the repo is not just model code; it includes dataset guidance and a concrete fine-tuning/training entrypoint.
Cons:
- Theoretical reconstruction — OpenMythos is explicitly not affiliated with Anthropic, so it should not be treated as an official Claude implementation.
- No pretrained checkpoint in the scraped page — you get the architecture and scripts, but not a ready-to-serve model artifact.
- High hardware demand at scale — the 100B+ presets and 1M-context variants are not for casual local runs.
- CUDA dependency for the fast path — Flash Attention 2 needs CUDA and build tools, which increases setup friction.
- Research-first docs — the documentation points to API reference and dataset notes, not a product guide or deployment handbook.
Getting Started with OpenMythos
pip install open-mythos
pip install open-mythos[flash]
python - <<'PY'
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=1000,
dim=256,
n_heads=8,
max_seq_len=128,
max_loop_iters=4,
prelude_layers=1,
coda_layers=1,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=64,
lora_rank=8,
attn_type='gqa',
n_kv_heads=2,
)
model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (2, 16))
print(model(ids, n_loops=4).shape)
PY
After that command finishes, you should have a working local install and a forward pass that proves the model and its dependencies are wired correctly. If you need the accelerated attention path, keep the optional flash extra; if you want to experiment with mla, change the config fields to the MLA-specific values from the repo’s example.
Verdict
OpenMythos is the strongest option for researchers validating recurrent-depth transformer ideas when they want a PyTorch reference implementation instead of a production model. Its strength is the explicit looped architecture with MoE and dual attention backends; the caveat is that it is a theoretical reconstruction, not an official vendor release. Use it for experiments, benchmarks, and architecture work.



