SSD is open-source under the MIT license, allowing free commercial and non-commercial use without restrictions. Download from GitHub and run locally on your CUDA hardware. No paid tiers or enterprise features exist.

How does SSD compare to vLLM?

SSD focuses on parallel speculative decoding for 2-4x gains on specific models like Llama3, while vLLM emphasizes continuous batching across broader architectures. SSD suits research benchmarks; vLLM fits production serving. Both use PagedAttention but differ in speculation parallelism.

What GPU architectures does SSD support?

SSD targets sm_90 (H100), sm_80 (A100), and sm_89 (L40/4090) via SSD_CUDA_ARCH env var. Requires CUDA 12.8+ for torch.compile and CUDA graphs. No CPU or older Ampere support yet.

How to use SSD for benchmarking?

Run `cd bench; python -O bench.py --model llama-3-8b --method ssd --all --numseqs 128` after setup. It evals across Humaneval, Alpaca, and two more datasets, printing tokens/sec averages. Use --numseqs to control sample size per dataset.

Does SSD support tensor parallelism?

Yes, SSD implements tensor parallelism for Llama-3 70B and Qwen-3 32B across multiple GPUs. Set SSD_CUDA_ARCH and ensure cluster MPI setup if needed. Combines with PagedAttention for large-scale inference.

Can SSD handle prefix caching?

SSD includes prefix caching to reuse KV computations on repeated prompts, integrated with PagedAttention. Enable via config during LLM init. Reduces latency by 10-20% on chat-like workloads.

Why is SSD faster than standard speculative decoding?

SSD parallelizes drafting and verification on separate hardware, speculating all verification outcomes upfront. Correct guesses return tokens immediately, eliminating sequential draft overhead. Gains peak on code/math datasets.

SSD: Best LLM Inference Engines for AI Inference Engineers in 2026

SSD runs drafting and verification of speculative decoding in parallel on separate hardware, eliminating drafting overhead when anticipation matches outcomes.

What Is SSD?

SSD is a speculative speculative decoding inference engine for large language models, built by tanishqkumar as an open-source reference implementation. It supports Qwen3 and Llama3 model families with features like tensor parallelism, PagedAttention, CUDA graphs, torch compilation, and prefix caching. SSD is one of the best LLM inference engines for AI inference engineers optimizing throughput on H100 GPUs, achieving exact inference faster than standard speculative decoding by parallelizing drafting and verification across hardware.

Quick Overview

Attribute	Details
Type	LLM Inference Engines
Best For	AI inference engineers
Language/Stack	Python/CUDA/PyTorch
License	MIT
GitHub Stars	546 as of Oct 2024
Pricing	Open-Source
Last Release	N/A — latest commit Oct 2024

Who Should Use SSD?

AI inference engineers on multi-GPU clusters tuning LLM serving for production workloads needing 2x+ speedups via parallel speculation.
ML researchers benchmarking decoding algorithms on datasets like Humaneval and Alpaca, requiring baselines for autoregressive, standard SD, and SSD modes.
HPC teams with H100/A100 hardware stacks handling tensor-parallel Llama-3 70B or Qwen-3 32B models under CUDA 12.8+.
Indie AI hackers prototyping fast local inference without vLLM overhead, focusing on exact speculative methods.

Not ideal for:

CPU-only environments, as SSD mandates CUDA 12.8+ and GPU architectures like sm_90 (H100).
Single-GPU low-memory setups under 80GB, due to tensor parallelism and model loading demands.
Teams needing broad model compatibility beyond Qwen3/Llama3, lacking support for GPT or Mistral families yet.

Key Features of SSD

Parallel Speculative Decoding — Small draft model anticipates verification outcomes across branches on separate GPUs, enabling immediate token acceptance if correct, with zero sequential overhead.
Optimized Baselines — Includes autoregressive decoding and standard speculative decoding modes for fair benchmarking, all under torch.compile for kernel fusion.
Tensor Parallelism — Splits Llama-3 70B or Qwen-3 32B across multiple GPUs via tensor parallelism, supporting up to H100 clusters with sm_90 compute capability.
PagedAttention Integration — Uses vLLM-style PagedAttention for efficient KV cache management, reducing memory fragmentation during long-sequence generation.
CUDA Graphs and Prefix Caching — Captures repetitive kernel launches in graphs for 20-30% latency reduction; caches prefixes to skip recomputation on repeated prompts.
Benchmark Suite — Built-in eval on Humaneval, Alpaca, and others via bench/ scripts, with --all flag for 4-dataset averages and --numseqs for scalable sampling.
UV Dependency Management — Leverages uv for fast Python 3.11+ env sync, including extras for download scripts handling HF hub models.

SSD vs Alternatives

Tool	Best For	Key Differentiator	Pricing
SSD	Parallel speculation on multi-GPU	Anticipatory branching eliminates draft overhead	Open-Source
vLLM	High-throughput serving	PagedAttention + continuous batching, broader models	Open-Source
TensorRT-LLM	NVIDIA-optimized latency	Engine compilation for single-GPU peaks	Open-Source
SGLang	Runtime optimizations	Zero-overhead loop + RadixAttention	Open-Source

vLLM excels in dynamic batching for API servers but runs speculation sequentially, capping at 1.5-2x autoregressive speeds; pick it for Mistral/GPT support over SSD's Qwen/Llama focus. TensorRT-LLM delivers sub-ms token latencies on A100s via static graphs but lacks native speculation, suiting latency-critical chats. SGLang optimizes Python runtimes with RadixAttention for 3x batching gains, better for variable-length inference than SSD's fixed parallelism. For more options, browse all LLM Inference Engines.

How SSD Works

SSD extends speculative decoding by having the draft model precompute tokens for all likely verification paths simultaneously on auxiliary hardware. The target model verifies in parallel; matching branches commit instantly, while mismatches fall back to single-step generation. This design assumes branching factor from Borges-inspired forking paths, handling exponential speculation without explosion via probability pruning.

Core data flow uses PyTorch tensors split via tensor parallelism: draft logits feed into a speculation tree, verified against target logits in one fused forward pass per branch. PagedAttention manages KV caches across devices, with CUDA graphs capturing load-compile-gen loops for steady-state throughput. Torch compilation fuses ops like softmax-crossentropy, yielding 15-25% kernel speedups on sm_90.

# Clone and setup
uv sync
export SSD_HF_CACHE=/data/huggingface/hub
export SSD_CUDA_ARCH=9.0  # H100
python scripts/download_from_hf.py llama
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 128

This installs deps, downloads Llama-3-8B to HF cache, then benchmarks SSD mode on 128 sequences per dataset. Expect 2-5 minute warmup for graph capture and torch.compile; subsequent runs hit peak tokens/sec matching dataset predictability.

Pros and Cons of SSD

Pros:

Achieves up to 3x autoregressive speeds on predictable data like code (Humaneval), via perfect speculation hits.
Parallel hardware utilization maximizes H100 cluster throughput, unlike sequential SD in vLLM.
Reference-quality baselines enable research: toggle --method ar/sd/ssd for apples-to-apples metrics.
Minimal deps beyond PyTorch/CUDA, with uv sync under 30s for reproducible envs.
Exact inference guarantees no quality loss, critical for eval benchmarks.
Prefix caching + PagedAttention handles 8k+ contexts without OOM on 80GB GPUs.

Cons:

Limited to Qwen3/Llama3; no GPT/Mistral sharding yet, requiring model conversion.
High setup barrier: CUDA 12.8+, Python 3.11, H100-optimal (sm_90), no CPU fallback.
No production serving: bench-focused, lacks HTTP endpoints or dynamic batching.
Warmup/compile times 3-5min for 70B models, unsuitable for cold-start APIs.
Dataset-specific gains: <1.5x on random prompts vs. 4x on code/math.

Getting Started with SSD

Start on a CUDA 12.8+ machine with H100/A100 GPUs. Install uv if missing, clone the repo, and sync the virtualenv.

git clone https://github.com/tanishqkumar/ssd && cd ssd
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv sync
source .venv/bin/activate
python -c "from ssd import LLM; print('ok')"
export SSD_HF_CACHE=/path/to/huggingface/hub
export SSD_DATASET_DIR=/path/to/processed_datasets
export SSD_CUDA_ARCH=9.0  # Adjust for A100=8.0
python scripts/download_from_hf.py llama
HF_DATASETS_CACHE=/path/to python scripts/get_data_from_hf.py --num-samples 10000
cd bench
python -O bench.py --model llama-3-8b --method ssd --numseqs 64

Commands create a .venv, download Llama models to $SSD_HF_CACHE, process datasets to $SSD_DATASET_DIR, then run SSD benchmarks. Initial run compiles kernels and builds CUDA graphs, printing tokens/sec per method/dataset. Scale with --all for full suite across Humaneval/Alpaca/etc., monitoring GPU util via nvidia-smi.

Verdict

SSD is the strongest option for AI inference engineers benchmarking parallel speculative decoding on Llama/Qwen when targeting H100 clusters under research constraints. Its anticipatory branching delivers unmatched speed on predictable data, backed by clean PyTorch baselines. Tradeoff is narrow model support and no serving layer—pair with OpenSwarm for agentic workloads, but skip for general APIs.

SSD: Best LLM Inference Engines for AI Inference Engineers in 2026

What Is SSD?

Quick Overview

Who Should Use SSD?

Key Features of SSD

SSD vs Alternatives

How SSD Works

Pros and Cons of SSD

Getting Started with SSD

Verdict

Frequently Asked Questions

Related Tools

TokenSpeed: Best Inference Engine for Agentic Workloads in 2026

Atlas Inference Engine: Best LLM Inference for Devs in 2026

rvLLM: Best LLM Inference Engines for ML Platform Teams in 2026

SSD: Best LLM Inference Engines for AI Inference Engineers in 2026

What Is SSD?

Quick Overview

Who Should Use SSD?

Key Features of SSD

SSD vs Alternatives

How SSD Works

Pros and Cons of SSD

Getting Started with SSD

Verdict

Frequently Asked Questions

Is SSD free to use?

How does SSD compare to vLLM?

What GPU architectures does SSD support?

How to use SSD for benchmarking?

Does SSD support tensor parallelism?

Can SSD handle prefix caching?

Why is SSD faster than standard speculative decoding?

Related Tools

TokenSpeed: Best Inference Engine for Agentic Workloads in 2026

Atlas Inference Engine: Best LLM Inference for Devs in 2026

rvLLM: Best LLM Inference Engines for ML Platform Teams in 2026