kitten-tts-rs — Text-to-Speech CLI Tools tool screenshot
Text-to-Speech CLI Tools

kitten-tts-rs: Best Text-to-Speech CLI Tools for AI Agent Developers in 2026

6 min read·

Rust binaries deliver ONNX-based TTS with 15-80M parameter models in 25-80MB, 100ms startup on CPU, and OpenAI-compatible SSE streaming API.

Pricing

Open-Source

Tech Stack

Rust / ONNX Runtime

Target

AI agent developers on edge devices

Category

Text-to-Speech CLI Tools

What Is kitten-tts-rs?

kitten-tts-rs is a Rust implementation of KittenTTS, an open-source text-to-speech engine from Second State, adapted from KittenML/KittenTTS under Apache-2.0 license. It provides self-contained CLI and API server binaries for high-quality voice synthesis using ONNX Runtime inference on CPU, with models from 15M to 80M parameters sized 25-80MB on disk. kitten-tts-rs is one of the best Text-to-Speech CLI Tools for AI agent developers on edge devices like Raspberry Pi, offering 24kHz output, 8 built-in voices, and ~100ms startup versus 2s Python overhead. As of February 2026, the repo holds 212 GitHub stars and supports Linux x86_64/aarch64 and macOS aarch64.

Quick Overview

AttributeDetails
TypeText-to-Speech CLI Tools
Best ForAI agent developers on edge devices
Language/StackRust / ONNX Runtime
LicenseApache-2.0
GitHub Stars212 as of Feb 2026
PricingOpen-Source
Last ReleaseN/A — latest commit b913a85 on recent date

Who Should Use kitten-tts-rs?

  • AI agent builders integrating local TTS into Rust-based agents needing offline synthesis without GPU or Python.
  • Embedded systems devs deploying on Raspberry Pi or phones, where 10MB binaries plus 25MB models fit tight storage.
  • Realtime audio app developers using SSE streaming via OpenAI-compatible /v1/audio/speech endpoint for low-latency voice output.
  • CLI automation scripters generating speech from shell scripts with adjustable speed and phonemization via espeak-ng.

Not ideal for:

  • Teams requiring 48kHz+ HiFi audio or neural vocoders like WaveGlow, as it sticks to 24kHz non-streaming synthesis per utterance.
  • GPU-heavy production servers favoring TensorRT acceleration over CPU-only defaults, despite optional features.
  • Developers locked into Python ecosystems unwilling to adopt Rust binaries or manage espeak-ng deps.

Key Features of kitten-tts-rs

  • ONNX CPU Inference — Runs 15-80M parameter models at int8 quantization for nano variant (25MB), achieving real-time synthesis on x86_64 or aarch64 without NVIDIA/Apple hardware.
  • Built-in Voices — Eight options (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo) selectable via CLI flags, covering diverse accents and genders from original KittenTTS weights.
  • Speed Control--speed parameter adjusts playback rate from 0.5x to 2x, modifying inference output directly without post-processing.
  • Text Preprocessing Pipeline — Handles numbers, currencies, units via espeak-ng phonemization, converting to phonemes before ONNX forward pass.
  • OpenAI-Compatible APIkitten-tts-server exposes /v1/audio/speech with SSE streaming support added in recent commits, mimicking OpenAI TTS endpoint for agent compatibility.
  • Model Variants — nano (15M/56MB fp32 or 25MB int8), micro (40M/41MB), mini (80M/80MB), downloadable from releases with Hugging Face paths like KittenML/kitten-tts-nano-0.8.
  • Cross-Platform Binaries — Pre-built for Linux/macos targets, ~10MB size, optional GPU via Cargo features (CUDA, CoreML, DirectML, TensorRT).

kitten-tts-rs vs Alternatives

ToolBest ForKey DifferentiatorPricing
kitten-tts-rsEdge AI agents with Rust/OpenAI API100ms startup, 25MB int8 models, SSE streamingOpen-Source
Piper TTSFast embedded TTS on ARMPiper voices, espeak-ng backend, C++ coreOpen-Source
Coqui TTSCustom model trainingXTTSv2 multilingual, PyTorch trainingOpen-Source
Silero ModelsServerless on-deviceTorchScript, 100+ languages, sub-50ms latencyOpen-Source

Piper TTS suits Raspberry Pi projects needing broader voice packs and neural vocoder options, but lacks OpenAI API compatibility and has higher memory use at 100MB+ per voice. Coqui TTS excels for teams training custom models from scratch with XTTS architecture, though Python deps and 1GB+ environments dwarf kitten-tts-rs footprint. Silero Models win for mobile apps with TorchScript export and Russian/English focus, but require PyTorch Mobile runtime unlike kitten-tts-rs ONNX purity. For related AI voice tools, check OpenSwarm or Moonshine Voice. Browse all Text-to-Speech CLI Tools.

How kitten-tts-rs Works

kitten-tts-rs ports KittenTTS architecture to Rust using ONNX Runtime for inference, bypassing Python with native bindings. Core flow: input text feeds espeak-ng phonemizer converting to IPA phonemes, then ONNX session loads model (e.g., nano-int8.onnx) for acoustic feature prediction via transformer blocks, outputting mel-spectrogram decoded to 24kHz WAV via Griffin-Lim or HiFi-GAN vocoder approximation. Server mode adds Actix-web for HTTP/SSE, streaming audio chunks from ONNX forward passes without buffering full utterances.

Models follow KittenTTS design: nano/micro/mini scale parameters (15/40/80M) with shared encoder-decoder, quantized to int8 for edge. Rust crate manages ONNX session caching, reducing load to 100ms via lazy init. GPU features route to vendor runtimes: CUDA for NVIDIA, CoreML for Apple Silicon, DirectML for Windows.

# CLI quick synthesis
./kitten-tts --model models/kitten-tts-nano-int8 --voice Bella "Hello, Rust TTS on edge."

# Server start with streaming
./kitten-tts-server --model models/kitten-tts-micro --port 8000

# Client curl to SSE endpoint
curl -X POST "http://localhost:8000/v1/audio/speech" \
  -H "Content-Type: application/json" \
  -d '{"model": "kitten-tts-micro", "input": "Streaming speech test.", "voice": "Luna"}' \
  --no-buffer

CLI command loads specified model, phonemizes text, runs ONNX inference, saves WAV to stdout/file. Server binds Actix routes, parses OpenAI JSON payload, yields SSE audio frames for realtime playback. Expect 200-500ms end-to-end on i7 CPU for nano model, scaling with text length.

Pros and Cons of kitten-tts-rs

Pros:

  • 10MB standalone binaries eliminate Python venv bloat, deploying instantly on Docker or bare metal.
  • ONNX int8 nano model synthesizes 10s speech in <1s on Raspberry Pi 5, verified via benchmarks in repo tests.
  • SSE streaming in API supports EchoKit-like apps, pushing 24kHz chunks at 50ms intervals without dropout.
  • Cross-platform builds cover aarch64 Linux/macOS, with Cargo features for CUDA/CoreML acceleration up to 5x speedup.
  • Preprocessing via espeak-ng normalizes dates/currencies to phonemes, avoiding garbage audio on edge inputs.
  • Apache-2.0 license with original weights enables commercial embedding in agents.

Cons:

  • Requires espeak-ng install for phonemization, adding 20MB dep on minimal systems.
  • Fixed 24kHz mono output lacks stereo or 48kHz upgrades, trailing Piper's neural vocoders.
  • No built-in voice cloning or fine-tuning; locked to 8 KittenTTS voices without retraining pipeline.
  • Server lacks auth/HTTPS out-of-box, exposing endpoints on --port bind.
  • Recent commits (16 total) indicate early stage, with tests focused on SSE rather than perf regression.

Getting Started with kitten-tts-rs

Install espeak-ng first, then grab binaries and models from releases.

# macOS espeak-ng
brew install espeak-ng

# Download Linux x86_64 binary
curl -LO https://github.com/second-state/kitten_tts_rs/releases/latest/download/kitten-tts-x86_64-linux.tar.gz
tar xzf kitten-tts-x86_64-linux.tar.gz

# Models
curl -LO https://github.com/second-state/kitten_tts_rs/releases/latest/download/kitten-tts-models.tar.gz
tar xzf kitten-tts-models.tar.gz

# Test CLI
./kitten-tts --model models/kitten-tts-nano-int8 --voice Leo --speed 1.2 "Quick Rust TTS test."

# Launch server
./kitten-tts-server --model models/kitten-tts-mini &

Extracted binaries run executable without install; models/ dir auto-detected. CLI outputs hello.wav immediately, piping to aplay for playback: ./kitten-tts ... | aplay -f S16_LE -r 24000 -c 1. Server logs "Listening on 0.0.0.0:8000", ready for curl POSTs. Config via flags only—no TOML/YAML; edit Cargo.toml for custom builds with "onnxruntime-cuda" feature.

Verdict

kitten-tts-rs stands as the strongest Text-to-Speech CLI Tool for AI agent developers when deploying offline Rust TTS on edge hardware under 100MB total. Its 100ms startup and SSE API enable realtime agents without cloud latency. Pick it over Piper for OpenAI compatibility, but add Piper voices if needing 100+ options.

Frequently Asked Questions

Looking for alternatives?

Compare kitten-tts-rs with other Text-to-Speech CLI Tools tools.

See Alternatives →

You Might Also Like