LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

Original: Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) View original →

Read in other languages: 한국어日本語
LLM Mar 27, 2026 By Insights AI (Reddit) 2 min read 1 views Source

What the LocalLLaMA post reported

On March 27, 2026, a LocalLLaMA post pointed readers to turboquant_plus, an open-source implementation of Google's TurboQuant ideas for llama.cpp, plus a short writeup on a new kernel optimization called sparse V dequantization. The practical claim is simple: during flash attention decode, most attention weights at long context are so small that dequantizing their V values is wasted work. Instead of trying to make every dequant faster, the patch skips dequantization for positions whose attention weight falls below 1e-6.

Why this is interesting

The repository argues that quantized KV cache schemes shift the bottleneck from memory capacity to dequant overhead. In the author's measurements on Apple Silicon, that overhead was large enough to drag long-context decode well below the no-dequant ceiling. The proposed fix is almost comically small: a three-line conditional in the V path of the kernel. But the reported effect is not small. In the accompanying markdown paper, the author says TurboQuant's turbo3 cache on Qwen3.5-35B-A3B moved from 47.0 tok/s to 57.7 tok/s at 32K context on an M5 Max, a 22.8% improvement. On standard q8_0 KV cache, the same idea still produced a 5% decode gain, which suggests the trick is not limited to one compression format.

Quality checks and caveats

The interesting part is that the writeup does not present the speedup as a free lunch without validation. It pairs the benchmark with perplexity and NIAH checks. Reported WikiText-2 perplexity stayed effectively unchanged, and single-needle retrieval improved from 7/9 to 9/9 under the sparse V setting. The explanation offered is that extremely low-weight positions may contribute more quantization noise than useful signal, so dropping them can clean up the accumulation rather than harm it.

  • Model and hardware: Qwen3.5-35B-A3B on Apple M5 Max using llama.cpp Metal kernels.
  • Main gain: +22.8% decode at 32K context for turbo3.
  • Generality claim: the same gating idea also improved q8_0 KV decode.

This remains an early community result, not a merged upstream standard. The repo notes broader testing is still ongoing, including CUDA experiments. Still, the post is valuable because it shows a familiar pattern in LLM systems work: once low-level instruction tricks hit a hardware floor, the better optimization is often to eliminate work entirely. In this case, the community experiment turns attention sparsity itself into the lever.

Community source: LocalLLaMA discussion. Original materials: repo and sparse-v-dequant writeup.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.