Is nanowhale free to use?

Yes, nanowhale is free to use because the repository is released under the MIT license. nanowhale can be cloned, modified, and reused in private or public projects without a commercial license fee. The main cost is compute, not software access.

How does nanowhale compare to Llama 3 8B?

nanowhale is much smaller than Llama 3 8B and is built for architecture inspection rather than strong general-purpose chat quality. nanowhale is better when you want to study DeepSeek-V4-style components such as MLA and MoE, while Llama 3 8B is better when you want a more capable baseline model.

Does nanowhale support fp32 inference?

Yes, nanowhale supports fp32 inference and the repo explicitly recommends it because bf16 can produce NaNs at this scale. nanowhale is safer in fp32 for both inference and training when Hyper-Connections are active. If you see instability, fp32 is the first fix to try.

Can nanowhale be loaded with from_pretrained?

nanowhale can be integrated with Hugging Face tooling, but the repo warns that `from_pretrained` may reinitialize some weights because of the custom architecture. nanowhale is more reliable when you load weights manually with `load_state_dict`. That is the safer path if you care about exact checkpoint fidelity.

What data was nanowhale trained on?

nanowhale was pretrained on FineWeb-Edu and then fine-tuned with SFT on SmolTalk. nanowhale therefore gives you both a base model checkpoint and a chat-oriented checkpoint. The repository also includes the tokenizer used during training.

Why does nanowhale use a 129,280-token vocabulary?

nanowhale uses the DeepSeek-V4 tokenizer, which has a 129,280-token vocabulary. nanowhale pays for that choice with a large embedding table, but it gains broader lexical coverage and tokenization behavior that matches the architecture it is trying to reproduce. The trade-off is visible in the parameter breakdown.

nanowhale: Best Open-Source LLM for ML Engineers in 2026

nanowhale is a 110M-parameter DeepSeek-V4 reference build that lets ML engineers test MLA, MoE, hyper-connections, and SFT plumbing without a frontier-scale budget.

What Is nanowhale?

nanowhale is a 110M-parameter language model repo built by Hugging Face contributor cpatino from scratch on the DeepSeek-V4 architecture. nanowhale is one of the best Open-Source LLM tools for ML engineers, LLM researchers, and model tinkerers who need a runnable codebase for pretraining, supervised fine-tuning, tokenizer work, chat inference, and architecture debugging. The repository includes the model code, configs, tokenizer, and two published checkpoints: nanowhale-100m-base and nanowhale-100m.

The key value is that nanowhale reproduces a modern sparse-routed LLM stack at miniature scale, so you can inspect every moving part without the hardware bill of a multi-billion-parameter system. It is not a polished product layer; it is a compact research implementation that makes the training and inference path visible.

Quick Overview

Attribute	Details
Type	Open-Source Language Models
Best For	ML engineers, LLM researchers, and model tinkerers
Language/Stack	PyTorch, Hugging Face TRL/SFTTrainer, YAML configs, custom DeepSeek-V4 modules
License	MIT
GitHub Stars	N/A as of Feb 2026
Pricing	Open-Source
Last Release	N/A

Who Should Use nanowhale?

nanowhale is the right fit when you need a small but faithful reference for a modern LLM architecture. It is especially useful for people who care about implementation details, not just checkpoint downloads.

Researchers validating architecture ideas who want to test MLA, MoE, and hyper-connections without scaling to a giant training run.
ML engineers building custom training loops who need a concrete reference for SFTTrainer, config-driven runs, tokenizer loading, and checkpoint upload flows.
Indie hackers exploring local LLM experiments who want a model they can actually inspect, modify, and run on a single H100-class GPU.
Open-model maintainers who need a minimal reproducible baseline for comparing routing, attention, and token prediction choices.

Not ideal for:

Teams that need a production chat model with broad instruction-following quality and low hallucination rates.
Users who want a plug-and-play from_pretrained experience with no custom loading logic.
Inference stacks that must stay on bf16 only, since the repo documents NaN issues at this scale.

Key Features of nanowhale

DeepSeek-V4 at miniature scale — nanowhale implements the full DeepSeek-V4 feature set in a 110M-parameter model, which makes it a rare end-to-end reference for modern sparse LLM design. The repo is useful when you want to study the interaction between architecture choices instead of reading about them in a paper.
Multi-Head Latent Attention — the model uses MLA with 8 heads, 1 KV head, 96-dim head size, and a 32 RoPE + 64 NoPE split. That structure reduces the usual attention footprint while still exposing the same design trade-offs you would see in a larger DeepSeek-style model.
Mixture-of-Experts routing — nanowhale includes 4 routed experts plus 1 shared expert with top-2 routing and a SwiGLU feed-forward block. This is the part you inspect when you want to understand expert sparsity, router behavior, and where capacity is actually spent.
Hyper-Connections — the repo uses hc_mult=4 and Sinkhorn routing with 2 iterations. The authors explicitly note that this layer can overflow bf16 at small scale, which is a useful reminder that architectural tricks can have precision costs.
Multi-Token Prediction — nanowhale includes one next-token prediction layer through MTP, giving you a concrete implementation of speculative-style auxiliary prediction instead of a toy placeholder. This matters if you are comparing loss shaping and sequence modeling behavior.
Tokenizer and vocab details — the model ships with a 129,280-token DeepSeek-V4 tokenizer, which is large relative to the model size. That choice is instructive because it shows the trade-off between lexical coverage and parameter budget, and it explains why the embedding table consumes 37% of parameters.
Training and evaluation scripts — the repo includes pretraining, SFT, smoke eval, chat, upload, data prep, and parameter-counting utilities. That makes nanowhale more than a checkpoint dump; it is a full training loop you can fork into an internal research stack or pair with Open R1 for adjacent open-model experimentation.

nanowhale vs Alternatives

Tool	Best For	Key Differentiator	Pricing
nanowhale	Architecture-level LLM experimentation	Full DeepSeek-V4 feature set in a tiny 110M model	Open-Source
Open R1	Open reasoning research workflows	Research-oriented open model stack with reasoning focus	Open-Source
Llama 3 8B	General-purpose baselines	Much larger, stronger default quality, less inspectable internals	Open-Source / Open-Weights
Mistral 7B	Efficient general chat baselines	Strong general-purpose performance with simpler deployment path	Open-Weights

Pick nanowhale when you care about how the model is built, not just how it answers prompts. If you want an open research workflow around reasoning systems rather than a model implementation, Open R1 is the more relevant adjacent choice.

Choose Llama 3 8B when you need a practical baseline for app integration, because it is far more capable as a general assistant and less likely to expose edge-case training instability. Choose Mistral 7B when you want a stronger deployment target with a simpler serving story and do not need the architectural micro-details.

If your work starts with data curation and corpus management rather than model code, pair nanowhale with DataHaven and build the dataset pipeline first. That split keeps nanowhale focused on model behavior while a separate system handles file ingestion, filtering, and training-set hygiene.

How nanowhale Works

nanowhale works by wiring a compact DeepSeek-V4-style transformer around a custom configuration, tokenizer, and training pipeline. The model keeps the architecture faithful while shrinking the size to about 110M parameters, which is small enough to inspect layer-by-layer but large enough to show real routing, attention, and optimization behavior.

The core abstraction is a standard causal language model with several non-standard blocks layered in: MLA for attention efficiency, MoE for sparse expert capacity, Hyper-Connections for routed feature mixing, and MTP for auxiliary next-token prediction. The repo stores those choices in configuration_deepseek_v4.py and modeling_deepseek_v4.py, which makes the architecture readable instead of hiding it behind a monolith.

The training scripts use config-driven entry points, so the same code path can run a full pretrain loop, a short debug run, or SFT on a chat corpus. This is the right design if you want reproducibility and fast experiments, but it also means the repo expects you to understand your runtime, because the authors warn that from_pretrained can reinitialize some weights and that bf16 can produce NaNs at this scale.

python scripts/train_pretrain.py --config configs/debug.yaml

That command launches a short pretraining run using the debug config, which is useful for verifying that your environment, tokenizer, and model wiring are correct before you spend compute. Expect logs, loss values, and checkpoint activity rather than a usable assistant after one run; the point is validation, not quality.

Pros and Cons of nanowhale

Pros:

Faithful architecture reference — nanowhale exposes MLA, MoE, hyper-connections, and MTP in one repo, which is rare for open model code.
Small enough to study — at ~110M parameters, it is dramatically easier to reason about than multi-billion-parameter systems.
Training pipeline included — pretraining, SFT, evaluation, chat, and upload scripts are all present, so the repo is end-to-end.
Clear performance telemetry — the repo reports 72 ms/step on a single H100 80GB with torch.compile, which gives you a concrete benchmark target.
Useful for precision debugging — the documented bf16 NaN problem makes nanowhale a practical case study for numerical stability issues.
MIT license — the code can be reused, forked, and embedded into internal research workflows without license friction.

Cons:

bf16 instability is real — the repo explicitly says fp32 is safer for inference and training because hyper-connections can overflow bf16.
from_pretrained is awkward — custom architecture behavior means you may need manual load_state_dict handling instead of a clean Hugging Face one-liner.
Large vocab for a small model — the 129K embedding table consumes 37% of parameters, which reduces capacity available for core language modeling.
Short context window — 2,048 tokens is fine for experiments, but it is small for long-document workflows.
Not a production assistant — the model is a research artifact with documented limitations, not a polished general chat system.

Getting Started with nanowhale

Start by installing dependencies, then run the debug training path before touching the full config. That sequence catches CUDA, tokenizer, and configuration issues early, which matters more here than on a plain transformer repo.

git clone https://github.com/huggingface/nanowhale.git
cd nanowhale
pip install -r requirements.txt
python scripts/train_pretrain.py --config configs/debug.yaml
python scripts/chat.py

After the debug run, you should have a validated environment and enough signal to decide whether to run the full configs/main_100m.yaml pretraining recipe or the SFT path in scripts/train_sft.py. If you load checkpoints manually, prefer load_state_dict over from_pretrained because the repo documents a custom-architecture quirk that can otherwise reset weights.

Verdict

nanowhale is the strongest option for architecture-level LLM experimentation when you want a faithful DeepSeek-V4-style stack on a tiny budget. Its main strength is that it exposes the full training and inference path in a compact repo, and its main caveat is bf16 instability plus custom loading friction. Use it for research, inspection, and prototyping, not as your production assistant.

nanowhale: Best Open-Source LLM for ML Engineers in 2026

What Is nanowhale?

Quick Overview

Who Should Use nanowhale?

Key Features of nanowhale

nanowhale vs Alternatives

How nanowhale Works

Pros and Cons of nanowhale

Getting Started with nanowhale

Verdict

Frequently Asked Questions

You Might Also Like

Mouzi: Best Desktop Automation for Windows Power Users in 2026

leak-hunter: Best Security CLI for Developers in 2026

Bento: Best Model Serving Runtime for Go Developers in 2026