Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
Original: attn-rot (ggerganov's "TurboQuant lite") is on the cusp of getting merged into llama.cpp View original →
A notable LocalLLaMA thread this week centered on attn-rot, ggerganov's open pull request to add activation rotation to llama.cpp for better quantization. The Reddit post packages benchmark tables, but the most useful details are in PR #21038 itself.
The proposal is deliberately simple. llama.cpp rotates Q, K, and V with a normalized Hadamard transform before caching, performs attention in the rotated space, then rotates the output back. Because the transform preserves dot products, the attention math still works while outliers are reduced, which in turn improves quantization behavior. The PR says it is backend-agnostic, introduces no new data types, and can work with existing quantization formats. As of April 1, 2026, the PR is still open, MLA is not supported, and an environment variable LLAMA_ATTN_ROT_DISABLE exists for disabling rotations.
The interesting part is how much quality it appears to recover at low precision. In the PR benchmarks, Qwen3 0.6B with q5_1 KV cache drops from a perplexity of 61.6992 on master to 14.1452 in the PR, and q4_1 falls from 212.479 to 22.2816. On larger models the gains are smaller but still directionally positive. The Reddit thread also surfaced KLD and tokens-per-second measurements suggesting that quality improves while throughput stays roughly in the same band, especially when models already use hybrid attention layouts. In a follow-up comment, ggerganov said the relative overhead is smaller for hybrid models like Qwen3.5 and at larger contexts, and that the branch looked good to merge from his side.
If this lands, it matters because llama.cpp sits at the center of local inference. A change that upgrades KV cache quantization without inventing a new format or sacrificing portability would immediately affect edge and desktop deployments. That is why the Reddit excitement is justified: this is the kind of low-level inference improvement that quietly widens what users can run locally.
Related Articles
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
Comments (0)
No comments yet. Be the first to comment!