What Is kitten-tts-rs?
kitten-tts-rs is a Rust implementation of KittenTTS, an open-source text-to-speech engine from Second State, adapted from KittenML/KittenTTS under Apache-2.0 license. It provides self-contained CLI and API server binaries for high-quality voice synthesis using ONNX Runtime inference on CPU, with models from 15M to 80M parameters sized 25-80MB on disk. kitten-tts-rs is one of the best Text-to-Speech CLI Tools for AI agent developers on edge devices like Raspberry Pi, offering 24kHz output, 8 built-in voices, and ~100ms startup versus 2s Python overhead. As of February 2026, the repo holds 212 GitHub stars and supports Linux x86_64/aarch64 and macOS aarch64.
Quick Overview
| Attribute | Details |
|---|---|
| Type | Text-to-Speech CLI Tools |
| Best For | AI agent developers on edge devices |
| Language/Stack | Rust / ONNX Runtime |
| License | Apache-2.0 |
| GitHub Stars | 212 as of Feb 2026 |
| Pricing | Open-Source |
| Last Release | N/A — latest commit b913a85 on recent date |
Who Should Use kitten-tts-rs?
- AI agent builders integrating local TTS into Rust-based agents needing offline synthesis without GPU or Python.
- Embedded systems devs deploying on Raspberry Pi or phones, where 10MB binaries plus 25MB models fit tight storage.
- Realtime audio app developers using SSE streaming via OpenAI-compatible /v1/audio/speech endpoint for low-latency voice output.
- CLI automation scripters generating speech from shell scripts with adjustable speed and phonemization via espeak-ng.
Not ideal for:
- Teams requiring 48kHz+ HiFi audio or neural vocoders like WaveGlow, as it sticks to 24kHz non-streaming synthesis per utterance.
- GPU-heavy production servers favoring TensorRT acceleration over CPU-only defaults, despite optional features.
- Developers locked into Python ecosystems unwilling to adopt Rust binaries or manage espeak-ng deps.
Key Features of kitten-tts-rs
- ONNX CPU Inference — Runs 15-80M parameter models at int8 quantization for nano variant (25MB), achieving real-time synthesis on x86_64 or aarch64 without NVIDIA/Apple hardware.
- Built-in Voices — Eight options (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo) selectable via CLI flags, covering diverse accents and genders from original KittenTTS weights.
- Speed Control —
--speedparameter adjusts playback rate from 0.5x to 2x, modifying inference output directly without post-processing. - Text Preprocessing Pipeline — Handles numbers, currencies, units via espeak-ng phonemization, converting to phonemes before ONNX forward pass.
- OpenAI-Compatible API —
kitten-tts-serverexposes /v1/audio/speech with SSE streaming support added in recent commits, mimicking OpenAI TTS endpoint for agent compatibility. - Model Variants — nano (15M/56MB fp32 or 25MB int8), micro (40M/41MB), mini (80M/80MB), downloadable from releases with Hugging Face paths like KittenML/kitten-tts-nano-0.8.
- Cross-Platform Binaries — Pre-built for Linux/macos targets, ~10MB size, optional GPU via Cargo features (CUDA, CoreML, DirectML, TensorRT).
kitten-tts-rs vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| kitten-tts-rs | Edge AI agents with Rust/OpenAI API | 100ms startup, 25MB int8 models, SSE streaming | Open-Source |
| Piper TTS | Fast embedded TTS on ARM | Piper voices, espeak-ng backend, C++ core | Open-Source |
| Coqui TTS | Custom model training | XTTSv2 multilingual, PyTorch training | Open-Source |
| Silero Models | Serverless on-device | TorchScript, 100+ languages, sub-50ms latency | Open-Source |
Piper TTS suits Raspberry Pi projects needing broader voice packs and neural vocoder options, but lacks OpenAI API compatibility and has higher memory use at 100MB+ per voice. Coqui TTS excels for teams training custom models from scratch with XTTS architecture, though Python deps and 1GB+ environments dwarf kitten-tts-rs footprint. Silero Models win for mobile apps with TorchScript export and Russian/English focus, but require PyTorch Mobile runtime unlike kitten-tts-rs ONNX purity. For related AI voice tools, check OpenSwarm or Moonshine Voice. Browse all Text-to-Speech CLI Tools.
How kitten-tts-rs Works
kitten-tts-rs ports KittenTTS architecture to Rust using ONNX Runtime for inference, bypassing Python with native bindings. Core flow: input text feeds espeak-ng phonemizer converting to IPA phonemes, then ONNX session loads model (e.g., nano-int8.onnx) for acoustic feature prediction via transformer blocks, outputting mel-spectrogram decoded to 24kHz WAV via Griffin-Lim or HiFi-GAN vocoder approximation. Server mode adds Actix-web for HTTP/SSE, streaming audio chunks from ONNX forward passes without buffering full utterances.
Models follow KittenTTS design: nano/micro/mini scale parameters (15/40/80M) with shared encoder-decoder, quantized to int8 for edge. Rust crate manages ONNX session caching, reducing load to 100ms via lazy init. GPU features route to vendor runtimes: CUDA for NVIDIA, CoreML for Apple Silicon, DirectML for Windows.
# CLI quick synthesis
./kitten-tts --model models/kitten-tts-nano-int8 --voice Bella "Hello, Rust TTS on edge."
# Server start with streaming
./kitten-tts-server --model models/kitten-tts-micro --port 8000
# Client curl to SSE endpoint
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model": "kitten-tts-micro", "input": "Streaming speech test.", "voice": "Luna"}' \
--no-buffer
CLI command loads specified model, phonemizes text, runs ONNX inference, saves WAV to stdout/file. Server binds Actix routes, parses OpenAI JSON payload, yields SSE audio frames for realtime playback. Expect 200-500ms end-to-end on i7 CPU for nano model, scaling with text length.
Pros and Cons of kitten-tts-rs
Pros:
- 10MB standalone binaries eliminate Python venv bloat, deploying instantly on Docker or bare metal.
- ONNX int8 nano model synthesizes 10s speech in <1s on Raspberry Pi 5, verified via benchmarks in repo tests.
- SSE streaming in API supports EchoKit-like apps, pushing 24kHz chunks at 50ms intervals without dropout.
- Cross-platform builds cover aarch64 Linux/macOS, with Cargo features for CUDA/CoreML acceleration up to 5x speedup.
- Preprocessing via espeak-ng normalizes dates/currencies to phonemes, avoiding garbage audio on edge inputs.
- Apache-2.0 license with original weights enables commercial embedding in agents.
Cons:
- Requires espeak-ng install for phonemization, adding 20MB dep on minimal systems.
- Fixed 24kHz mono output lacks stereo or 48kHz upgrades, trailing Piper's neural vocoders.
- No built-in voice cloning or fine-tuning; locked to 8 KittenTTS voices without retraining pipeline.
- Server lacks auth/HTTPS out-of-box, exposing endpoints on --port bind.
- Recent commits (16 total) indicate early stage, with tests focused on SSE rather than perf regression.
Getting Started with kitten-tts-rs
Install espeak-ng first, then grab binaries and models from releases.
# macOS espeak-ng
brew install espeak-ng
# Download Linux x86_64 binary
curl -LO https://github.com/second-state/kitten_tts_rs/releases/latest/download/kitten-tts-x86_64-linux.tar.gz
tar xzf kitten-tts-x86_64-linux.tar.gz
# Models
curl -LO https://github.com/second-state/kitten_tts_rs/releases/latest/download/kitten-tts-models.tar.gz
tar xzf kitten-tts-models.tar.gz
# Test CLI
./kitten-tts --model models/kitten-tts-nano-int8 --voice Leo --speed 1.2 "Quick Rust TTS test."
# Launch server
./kitten-tts-server --model models/kitten-tts-mini &
Extracted binaries run executable without install; models/ dir auto-detected. CLI outputs hello.wav immediately, piping to aplay for playback: ./kitten-tts ... | aplay -f S16_LE -r 24000 -c 1. Server logs "Listening on 0.0.0.0:8000", ready for curl POSTs. Config via flags only—no TOML/YAML; edit Cargo.toml for custom builds with "onnxruntime-cuda" feature.
Verdict
kitten-tts-rs stands as the strongest Text-to-Speech CLI Tool for AI agent developers when deploying offline Rust TTS on edge hardware under 100MB total. Its 100ms startup and SSE API enable realtime agents without cloud latency. Pick it over Piper for OpenAI compatibility, but add Piper voices if needing 100+ options.



