Reddit tracks attn-rot landing in llama.cpp as a low-cost quantization upgrade
Original: attn-rot (TurboQuant-like KV cache trick) lands in llama.cpp View original →
A high-signal r/LocalLLaMA thread is tracking the merge of llama.cpp PR #21038 on April 1, 2026. The change, authored by ggerganov, adds an activation-rotation approach for attention that the PR describes as a simple interpretation of ideas around TurboQuant. In practice, the code rotates input Q, K, and V using a normalized Hadamard matrix, performs attention in the rotated space, and rotates the output back.
The attraction is that this is intentionally conservative. The PR does not introduce a new quantization type. It aims to stay backend-agnostic and compatible with existing quantizations while improving how quantized caches behave. The author argues that rotation reduces outliers, which helps low-bit representations preserve attention quality. The LocalLLaMA post summarizes that as getting roughly “80% of the benefit” of TurboQuant with few downsides, and the benchmark table in the PR backs up the intuition: several q4 and q5 cache configurations move materially closer to F16 perplexity on Qwen and Gemma models.
- The PR was opened on March 26, 2026 and merged on April 1, 2026.
- It changes four files with 337 additions and 26 deletions.
- The author explicitly notes that MLA is not supported and that techniques such as PolarQuant or QJL are not part of this patch.
That mix of modest scope and measurable gain is exactly why Reddit cares. The local-model community is full of experimental forks that prove a paper idea but are painful to maintain. By contrast, once a technique lands upstream in llama.cpp, it becomes part of the practical toolchain for everyday quantized inference. The comments treat that as the bigger story: not the novelty of one more optimization, but the fact that a useful compression idea may now be available without leaving the mainstream stack.
It is still early. The PR body itself says more evaluation is needed, especially beyond the published perplexity tables. But for people trying to run capable models on limited VRAM, even a simple rotation trick that preserves quality in q4 or q5 caches is a meaningful step. The Reddit thread reads as a sign that inference engineering is increasingly about shipping the lowest-friction improvements first, not waiting for the full academic package to land all at once.
References: the llama.cpp PR and the r/LocalLLaMA thread.
Related Articles
A high-scoring r/LocalLLaMA post explains TurboQuant not as a polar-coordinates trick but as random rotation before quantization. The linked arXiv paper claims near-optimal distortion rates, a residual QJL stage for inner products, and quality-neutral KV cache quantization at 3.5 bits per channel.
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
Google Research introduced TurboQuant on March 24, 2026 as a compression approach for KV cache and vector search bottlenecks. Hacker News pushed the post to 491 points and 129 comments, reflecting how central memory efficiency has become for long-context inference.
Comments (0)
No comments yet. Be the first to comment!