Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
Original: attn-rot (ggerganov's "TurboQuant lite") is on the cusp of getting merged into llama.cpp View original →
A notable LocalLLaMA thread this week centered on attn-rot, ggerganov's open pull request to add activation rotation to llama.cpp for better quantization. The Reddit post packages benchmark tables, but the most useful details are in PR #21038 itself.
The proposal is deliberately simple. llama.cpp rotates Q, K, and V with a normalized Hadamard transform before caching, performs attention in the rotated space, then rotates the output back. Because the transform preserves dot products, the attention math still works while outliers are reduced, which in turn improves quantization behavior. The PR says it is backend-agnostic, introduces no new data types, and can work with existing quantization formats. As of April 1, 2026, the PR is still open, MLA is not supported, and an environment variable LLAMA_ATTN_ROT_DISABLE exists for disabling rotations.
The interesting part is how much quality it appears to recover at low precision. In the PR benchmarks, Qwen3 0.6B with q5_1 KV cache drops from a perplexity of 61.6992 on master to 14.1452 in the PR, and q4_1 falls from 212.479 to 22.2816. On larger models the gains are smaller but still directionally positive. The Reddit thread also surfaced KLD and tokens-per-second measurements suggesting that quality improves while throughput stays roughly in the same band, especially when models already use hybrid attention layouts. In a follow-up comment, ggerganov said the relative overhead is smaller for hybrid models like Qwen3.5 and at larger contexts, and that the branch looked good to merge from his side.
If this lands, it matters because llama.cpp sits at the center of local inference. A change that upgrades KV cache quantization without inventing a new format or sacrificing portability would immediately affect edge and desktop deployments. That is why the Reddit excitement is justified: this is the kind of low-level inference improvement that quietly widens what users can run locally.
Related Articles
Google Research introduced TurboQuant on March 24, 2026 as a compression approach for KV cache and vector search bottlenecks. Hacker News pushed the post to 491 points and 129 comments, reflecting how central memory efficiency has become for long-context inference.
A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
Comments (0)
No comments yet. Be the first to comment!