Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

A notable LocalLLaMA thread this week centered on attn-rot, ggerganov's open pull request to add activation rotation to llama.cpp for better quantization. The Reddit post packages benchmark tables, but the most useful details are in PR #21038 itself.

The proposal is deliberately simple. llama.cpp rotates Q, K, and V with a normalized Hadamard transform before caching, performs attention in the rotated space, then rotates the output back. Because the transform preserves dot products, the attention math still works while outliers are reduced, which in turn improves quantization behavior. The PR says it is backend-agnostic, introduces no new data types, and can work with existing quantization formats. As of April 1, 2026, the PR is still open, MLA is not supported, and an environment variable LLAMA_ATTN_ROT_DISABLE exists for disabling rotations.

The interesting part is how much quality it appears to recover at low precision. In the PR benchmarks, Qwen3 0.6B with q5_1 KV cache drops from a perplexity of 61.6992 on master to 14.1452 in the PR, and q4_1 falls from 212.479 to 22.2816. On larger models the gains are smaller but still directionally positive. The Reddit thread also surfaced KLD and tokens-per-second measurements suggesting that quality improves while throughput stays roughly in the same band, especially when models already use hybrid attention layouts. In a follow-up comment, ggerganov said the relative overhead is smaller for hybrid models like Qwen3.5 and at larger contexts, and that the branch looked good to merge from his side.

If this lands, it matters because llama.cpp sits at the center of local inference. A change that upgrades KV cache quantization without inventing a new format or sacrificing portability would immediately affect edge and desktop deployments. That is why the Reddit excitement is justified: this is the kind of low-level inference improvement that quietly widens what users can run locally.

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

Related Articles

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

Related Articles

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
LLM Reddit Mar 8, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
LLM Reddit Mar 20, 2026 2 min read