LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp
Original: Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) View original →
What the LocalLLaMA post reported
On March 27, 2026, a LocalLLaMA post pointed readers to turboquant_plus, an open-source implementation of Google's TurboQuant ideas for llama.cpp, plus a short writeup on a new kernel optimization called sparse V dequantization. The practical claim is simple: during flash attention decode, most attention weights at long context are so small that dequantizing their V values is wasted work. Instead of trying to make every dequant faster, the patch skips dequantization for positions whose attention weight falls below 1e-6.
Why this is interesting
The repository argues that quantized KV cache schemes shift the bottleneck from memory capacity to dequant overhead. In the author's measurements on Apple Silicon, that overhead was large enough to drag long-context decode well below the no-dequant ceiling. The proposed fix is almost comically small: a three-line conditional in the V path of the kernel. But the reported effect is not small. In the accompanying markdown paper, the author says TurboQuant's turbo3 cache on Qwen3.5-35B-A3B moved from 47.0 tok/s to 57.7 tok/s at 32K context on an M5 Max, a 22.8% improvement. On standard q8_0 KV cache, the same idea still produced a 5% decode gain, which suggests the trick is not limited to one compression format.
Quality checks and caveats
The interesting part is that the writeup does not present the speedup as a free lunch without validation. It pairs the benchmark with perplexity and NIAH checks. Reported WikiText-2 perplexity stayed effectively unchanged, and single-needle retrieval improved from 7/9 to 9/9 under the sparse V setting. The explanation offered is that extremely low-weight positions may contribute more quantization noise than useful signal, so dropping them can clean up the accumulation rather than harm it.
- Model and hardware: Qwen3.5-35B-A3B on Apple M5 Max using llama.cpp Metal kernels.
- Main gain: +22.8% decode at 32K context for
turbo3. - Generality claim: the same gating idea also improved
q8_0KV decode.
This remains an early community result, not a merged upstream standard. The repo notes broader testing is still ongoing, including CUDA experiments. Still, the post is valuable because it shows a familiar pattern in LLM systems work: once low-level instruction tricks hit a hardware floor, the better optimization is often to eliminate work entirely. In this case, the community experiment turns attention sparsity itself into the lever.
Community source: LocalLLaMA discussion. Original materials: repo and sparse-v-dequant writeup.
Related Articles
Hacker News picked up Google Research's TurboQuant because it promises 3-bit KV-cache compression without fine-tuning while targeting both vector search and long-context inference.
Hacker News noticed Hypura because it treats Apple Silicon memory limits as a scheduling problem, spreading tensors across GPU, RAM, and NVMe instead of letting oversized models crash.
The Reddit thread focused on a practical claim with real systems implications: replace TurboQuant's dense rotation with structured rotor math, keep attention fidelity close, and make the kernel much cheaper on NVIDIA and Apple hardware.
Comments (0)
No comments yet. Be the first to comment!