DeepSeek open-sourced DSpark, a set of inference optimizations that cut generation latency by 60-85% according to the paper hosted at github.com/deepseek-ai/DeepSpec. The release was flagged on Hacker News where the thread reached 398 points and 118 comments.
Optimization: DSpark | Speedup: 60-85% | License: Open source | Source: DeepSeek paper
What It Is and How It Works
DSpark combines kernel-level scheduling changes with dynamic batching adjustments during autoregressive decoding. The approach targets memory-bound operations in transformer attention and feed-forward layers without altering model weights.
The optimizations apply at runtime through modified CUDA kernels and a lightweight scheduler that reorders token generation steps. No retraining or fine-tuning is required.
Measured Speedups Across Models
The paper reports consistent gains on multiple model sizes. Larger models show higher relative improvements because they spend more time in memory-bound phases.
| Model Size | Baseline Latency | DSpark Latency | Speedup Range |
|---|---|---|---|
| 7B | 42 ms/token | 16 ms/token | 62% |
| 33B | 78 ms/token | 24 ms/token | 69% |
| 70B | 131 ms/token | 39 ms/token | 70-85% |
Early testers on the HN thread confirmed similar numbers on A100 and H100 hardware when running the provided patches.
How to Try DSpark
Clone the repository and apply the supplied kernel patches to an existing vLLM or Hugging Face Text Generation Inference deployment. The paper includes exact commit hashes and configuration flags for immediate testing.
A minimal integration requires only two additional environment variables and recompilation of the custom CUDA extensions. Pre-built wheels are not yet available.
Pros and Cons
- Achieves 60-85% latency reduction on standard GPU hardware without extra cost.
- Works on existing model checkpoints with no retraining.
Open-source release allows direct inspection of the kernel changes.
Requires recompilation of CUDA extensions for each CUDA version.
Limited documentation on multi-node scaling beyond single-server setups.
Current implementation targets NVIDIA GPUs only.
Alternatives and Comparisons
vLLM and TensorRT-LLM already provide strong baseline performance. DSpark layers on top of these systems rather than replacing them.
| Feature | vLLM (baseline) | TensorRT-LLM | DSpark + vLLM |
|---|---|---|---|
| Speedup vs naive | 2-3× | 3-4× | 4.5-6× |
| Code changes | None | Model export | Kernel patch |
| License | Apache 2.0 | NVIDIA | Open source |
Who Should Use This
Teams running high-volume inference on 7B-70B models benefit most. Organizations already using vLLM can adopt the patches with minimal engineering effort.
Teams without CUDA compilation experience or those deploying on non-NVIDIA hardware should wait for broader packaging.
Bottom Line
DSpark delivers the largest publicly reported single-change inference speedup for open models in 2024 while remaining compatible with existing serving stacks.
The release lowers the barrier for production deployments that previously required expensive hardware upgrades. Continued community patches will likely extend support to additional runtimes within weeks.

Top comments (0)