What Is SSD?
SSD is a speculative speculative decoding inference engine for large language models, built by tanishqkumar as an open-source reference implementation. It supports Qwen3 and Llama3 model families with features like tensor parallelism, PagedAttention, CUDA graphs, torch compilation, and prefix caching. SSD is one of the best LLM inference engines for AI inference engineers optimizing throughput on H100 GPUs, achieving exact inference faster than standard speculative decoding by parallelizing drafting and verification across hardware.
Quick Overview
| Attribute | Details |
|---|---|
| Type | LLM Inference Engines |
| Best For | AI inference engineers |
| Language/Stack | Python/CUDA/PyTorch |
| License | MIT |
| GitHub Stars | 546 as of Oct 2024 |
| Pricing | Open-Source |
| Last Release | N/A — latest commit Oct 2024 |
Who Should Use SSD?
- AI inference engineers on multi-GPU clusters tuning LLM serving for production workloads needing 2x+ speedups via parallel speculation.
- ML researchers benchmarking decoding algorithms on datasets like Humaneval and Alpaca, requiring baselines for autoregressive, standard SD, and SSD modes.
- HPC teams with H100/A100 hardware stacks handling tensor-parallel Llama-3 70B or Qwen-3 32B models under CUDA 12.8+.
- Indie AI hackers prototyping fast local inference without vLLM overhead, focusing on exact speculative methods.
Not ideal for:
- CPU-only environments, as SSD mandates CUDA 12.8+ and GPU architectures like sm_90 (H100).
- Single-GPU low-memory setups under 80GB, due to tensor parallelism and model loading demands.
- Teams needing broad model compatibility beyond Qwen3/Llama3, lacking support for GPT or Mistral families yet.
Key Features of SSD
- Parallel Speculative Decoding — Small draft model anticipates verification outcomes across branches on separate GPUs, enabling immediate token acceptance if correct, with zero sequential overhead.
- Optimized Baselines — Includes autoregressive decoding and standard speculative decoding modes for fair benchmarking, all under torch.compile for kernel fusion.
- Tensor Parallelism — Splits Llama-3 70B or Qwen-3 32B across multiple GPUs via tensor parallelism, supporting up to H100 clusters with sm_90 compute capability.
- PagedAttention Integration — Uses vLLM-style PagedAttention for efficient KV cache management, reducing memory fragmentation during long-sequence generation.
- CUDA Graphs and Prefix Caching — Captures repetitive kernel launches in graphs for 20-30% latency reduction; caches prefixes to skip recomputation on repeated prompts.
- Benchmark Suite — Built-in eval on Humaneval, Alpaca, and others via bench/ scripts, with --all flag for 4-dataset averages and --numseqs for scalable sampling.
- UV Dependency Management — Leverages uv for fast Python 3.11+ env sync, including extras for download scripts handling HF hub models.
SSD vs Alternatives
| Tool | Best For | Key Differentiator | Pricing |
|---|---|---|---|
| SSD | Parallel speculation on multi-GPU | Anticipatory branching eliminates draft overhead | Open-Source |
| vLLM | High-throughput serving | PagedAttention + continuous batching, broader models | Open-Source |
| TensorRT-LLM | NVIDIA-optimized latency | Engine compilation for single-GPU peaks | Open-Source |
| SGLang | Runtime optimizations | Zero-overhead loop + RadixAttention | Open-Source |
vLLM excels in dynamic batching for API servers but runs speculation sequentially, capping at 1.5-2x autoregressive speeds; pick it for Mistral/GPT support over SSD's Qwen/Llama focus. TensorRT-LLM delivers sub-ms token latencies on A100s via static graphs but lacks native speculation, suiting latency-critical chats. SGLang optimizes Python runtimes with RadixAttention for 3x batching gains, better for variable-length inference than SSD's fixed parallelism. For more options, browse all LLM Inference Engines.
How SSD Works
SSD extends speculative decoding by having the draft model precompute tokens for all likely verification paths simultaneously on auxiliary hardware. The target model verifies in parallel; matching branches commit instantly, while mismatches fall back to single-step generation. This design assumes branching factor from Borges-inspired forking paths, handling exponential speculation without explosion via probability pruning.
Core data flow uses PyTorch tensors split via tensor parallelism: draft logits feed into a speculation tree, verified against target logits in one fused forward pass per branch. PagedAttention manages KV caches across devices, with CUDA graphs capturing load-compile-gen loops for steady-state throughput. Torch compilation fuses ops like softmax-crossentropy, yielding 15-25% kernel speedups on sm_90.
# Clone and setup
uv sync
export SSD_HF_CACHE=/data/huggingface/hub
export SSD_CUDA_ARCH=9.0 # H100
python scripts/download_from_hf.py llama
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 128
This installs deps, downloads Llama-3-8B to HF cache, then benchmarks SSD mode on 128 sequences per dataset. Expect 2-5 minute warmup for graph capture and torch.compile; subsequent runs hit peak tokens/sec matching dataset predictability.
Pros and Cons of SSD
Pros:
- Achieves up to 3x autoregressive speeds on predictable data like code (Humaneval), via perfect speculation hits.
- Parallel hardware utilization maximizes H100 cluster throughput, unlike sequential SD in vLLM.
- Reference-quality baselines enable research: toggle --method ar/sd/ssd for apples-to-apples metrics.
- Minimal deps beyond PyTorch/CUDA, with uv sync under 30s for reproducible envs.
- Exact inference guarantees no quality loss, critical for eval benchmarks.
- Prefix caching + PagedAttention handles 8k+ contexts without OOM on 80GB GPUs.
Cons:
- Limited to Qwen3/Llama3; no GPT/Mistral sharding yet, requiring model conversion.
- High setup barrier: CUDA 12.8+, Python 3.11, H100-optimal (sm_90), no CPU fallback.
- No production serving: bench-focused, lacks HTTP endpoints or dynamic batching.
- Warmup/compile times 3-5min for 70B models, unsuitable for cold-start APIs.
- Dataset-specific gains: <1.5x on random prompts vs. 4x on code/math.
Getting Started with SSD
Start on a CUDA 12.8+ machine with H100/A100 GPUs. Install uv if missing, clone the repo, and sync the virtualenv.
git clone https://github.com/tanishqkumar/ssd && cd ssd
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv sync
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"
export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0 # Adjust for A100=8.0
python scripts/download_from_hf.py llama
HF_DATASETS_CACHE=/path/to python scripts/get_data_from_hf.py --num-samples 10000
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 64
Commands create a .venv, download Llama models to $SSD_HF_CACHE, process datasets to $SSD_DATASET_DIR, then run SSD benchmarks. Initial run compiles kernels and builds CUDA graphs, printing tokens/sec per method/dataset. Scale with --all for full suite across Humaneval/Alpaca/etc., monitoring GPU util via nvidia-smi.
Verdict
SSD is the strongest option for AI inference engineers benchmarking parallel speculative decoding on Llama/Qwen when targeting H100 clusters under research constraints. Its anticipatory branching delivers unmatched speed on predictable data, backed by clean PyTorch baselines. Tradeoff is narrow model support and no serving layer—pair with OpenSwarm for agentic workloads, but skip for general APIs.



