What Is nanowhale?
nanowhale is a 110M-parameter language model repo built by Hugging Face contributor cpatino from scratch on the DeepSeek-V4 architecture. nanowhale is one of the best Open-Source LLM tools for ML engineers, LLM researchers, and model tinkerers who need a runnable codebase for pretraining, supervised fine-tuning, tokenizer work, chat inference, and architecture debugging. The repository includes the model code, configs, tokenizer, and two published checkpoints: nanowhale-100m-base and nanowhale-100m.
The key value is that nanowhale reproduces a modern sparse-routed LLM stack at miniature scale, so you can inspect every moving part without the hardware bill of a multi-billion-parameter system. It is not a polished product layer; it is a compact research implementation that makes the training and inference path visible.
Quick Overview
| Attribute | Details |
|---|---|
| Type | Open-Source Language Models |
| Best For | ML engineers, LLM researchers, and model tinkerers |
| Language/Stack | PyTorch, Hugging Face TRL/SFTTrainer, YAML configs, custom DeepSeek-V4 modules |
| License | MIT |
| GitHub Stars | N/A as of Feb 2026 |
| Pricing | Open-Source |
| Last Release | N/A |
Who Should Use nanowhale?
nanowhale is the right fit when you need a small but faithful reference for a modern LLM architecture. It is especially useful for people who care about implementation details, not just checkpoint downloads.
- Researchers validating architecture ideas who want to test MLA, MoE, and hyper-connections without scaling to a giant training run.
- ML engineers building custom training loops who need a concrete reference for
SFTTrainer, config-driven runs, tokenizer loading, and checkpoint upload flows. - Indie hackers exploring local LLM experiments who want a model they can actually inspect, modify, and run on a single H100-class GPU.
- Open-model maintainers who need a minimal reproducible baseline for comparing routing, attention, and token prediction choices.
Not ideal for:
- Teams that need a production chat model with broad instruction-following quality and low hallucination rates.
- Users who want a plug-and-play
from_pretrainedexperience with no custom loading logic. - Inference stacks that must stay on bf16 only, since the repo documents NaN issues at this scale.
Key Features of nanowhale
- DeepSeek-V4 at miniature scale — nanowhale implements the full DeepSeek-V4 feature set in a 110M-parameter model, which makes it a rare end-to-end reference for modern sparse LLM design. The repo is useful when you want to study the interaction between architecture choices instead of reading about them in a paper.
- Multi-Head Latent Attention — the model uses MLA with 8 heads, 1 KV head, 96-dim head size, and a 32 RoPE + 64 NoPE split. That structure reduces the usual attention footprint while still exposing the same design trade-offs you would see in a larger DeepSeek-style model.
- Mixture-of-Experts routing — nanowhale includes 4 routed experts plus 1 shared expert with top-2 routing and a SwiGLU feed-forward block. This is the part you inspect when you want to understand expert sparsity, router behavior, and where capacity is actually spent.
- Hyper-Connections — the repo uses
hc_mult=4and Sinkhorn routing with 2 iterations. The authors explicitly note that this layer can overflow bf16 at small scale, which is a useful reminder that architectural tricks can have precision costs. - Multi-Token Prediction — nanowhale includes one next-token prediction layer through MTP, giving you a concrete implementation of speculative-style auxiliary prediction instead of a toy placeholder. This matters if you are comparing loss shaping and sequence modeling behavior.
- Tokenizer and vocab details — the model ships with a 129,280-token DeepSeek-V4 tokenizer, which is large relative to the model size. That choice is instructive because it shows the trade-off between lexical coverage and parameter budget, and it explains why the embedding table consumes 37% of parameters.
- Training and evaluation scripts — the repo includes pretraining, SFT, smoke eval, chat, upload, data prep, and parameter-counting utilities. That makes nanowhale more than a checkpoint dump; it is a full training loop you can fork into an internal research stack or pair with Open R1 for adjacent open-model experimentation.
nanowhale vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| nanowhale | Architecture-level LLM experimentation | Full DeepSeek-V4 feature set in a tiny 110M model | Open-Source |
| Open R1 | Open reasoning research workflows | Research-oriented open model stack with reasoning focus | Open-Source |
| Llama 3 8B | General-purpose baselines | Much larger, stronger default quality, less inspectable internals | Open-Source / Open-Weights |
| Mistral 7B | Efficient general chat baselines | Strong general-purpose performance with simpler deployment path | Open-Weights |
Pick nanowhale when you care about how the model is built, not just how it answers prompts. If you want an open research workflow around reasoning systems rather than a model implementation, Open R1 is the more relevant adjacent choice.
Choose Llama 3 8B when you need a practical baseline for app integration, because it is far more capable as a general assistant and less likely to expose edge-case training instability. Choose Mistral 7B when you want a stronger deployment target with a simpler serving story and do not need the architectural micro-details.
If your work starts with data curation and corpus management rather than model code, pair nanowhale with DataHaven and build the dataset pipeline first. That split keeps nanowhale focused on model behavior while a separate system handles file ingestion, filtering, and training-set hygiene.
How nanowhale Works
nanowhale works by wiring a compact DeepSeek-V4-style transformer around a custom configuration, tokenizer, and training pipeline. The model keeps the architecture faithful while shrinking the size to about 110M parameters, which is small enough to inspect layer-by-layer but large enough to show real routing, attention, and optimization behavior.
The core abstraction is a standard causal language model with several non-standard blocks layered in: MLA for attention efficiency, MoE for sparse expert capacity, Hyper-Connections for routed feature mixing, and MTP for auxiliary next-token prediction. The repo stores those choices in configuration_deepseek_v4.py and modeling_deepseek_v4.py, which makes the architecture readable instead of hiding it behind a monolith.
The training scripts use config-driven entry points, so the same code path can run a full pretrain loop, a short debug run, or SFT on a chat corpus. This is the right design if you want reproducibility and fast experiments, but it also means the repo expects you to understand your runtime, because the authors warn that from_pretrained can reinitialize some weights and that bf16 can produce NaNs at this scale.
python scripts/train_pretrain.py --config configs/debug.yaml
That command launches a short pretraining run using the debug config, which is useful for verifying that your environment, tokenizer, and model wiring are correct before you spend compute. Expect logs, loss values, and checkpoint activity rather than a usable assistant after one run; the point is validation, not quality.
Pros and Cons of nanowhale
Pros:
- Faithful architecture reference — nanowhale exposes MLA, MoE, hyper-connections, and MTP in one repo, which is rare for open model code.
- Small enough to study — at ~110M parameters, it is dramatically easier to reason about than multi-billion-parameter systems.
- Training pipeline included — pretraining, SFT, evaluation, chat, and upload scripts are all present, so the repo is end-to-end.
- Clear performance telemetry — the repo reports 72 ms/step on a single H100 80GB with
torch.compile, which gives you a concrete benchmark target. - Useful for precision debugging — the documented bf16 NaN problem makes nanowhale a practical case study for numerical stability issues.
- MIT license — the code can be reused, forked, and embedded into internal research workflows without license friction.
Cons:
- bf16 instability is real — the repo explicitly says fp32 is safer for inference and training because hyper-connections can overflow bf16.
from_pretrainedis awkward — custom architecture behavior means you may need manualload_state_dicthandling instead of a clean Hugging Face one-liner.- Large vocab for a small model — the 129K embedding table consumes 37% of parameters, which reduces capacity available for core language modeling.
- Short context window — 2,048 tokens is fine for experiments, but it is small for long-document workflows.
- Not a production assistant — the model is a research artifact with documented limitations, not a polished general chat system.
Getting Started with nanowhale
Start by installing dependencies, then run the debug training path before touching the full config. That sequence catches CUDA, tokenizer, and configuration issues early, which matters more here than on a plain transformer repo.
git clone https://github.com/huggingface/nanowhale.git
cd nanowhale
pip install -r requirements.txt
python scripts/train_pretrain.py --config configs/debug.yaml
python scripts/chat.py
After the debug run, you should have a validated environment and enough signal to decide whether to run the full configs/main_100m.yaml pretraining recipe or the SFT path in scripts/train_sft.py. If you load checkpoints manually, prefer load_state_dict over from_pretrained because the repo documents a custom-architecture quirk that can otherwise reset weights.
Verdict
nanowhale is the strongest option for architecture-level LLM experimentation when you want a faithful DeepSeek-V4-style stack on a tiny budget. Its main strength is that it exposes the full training and inference path in a compact repo, and its main caveat is bf16 instability plus custom loading friction. Use it for research, inspection, and prototyping, not as your production assistant.



