Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

Original: attn-rot (ggerganov's "TurboQuant lite") is on the cusp of getting merged into llama.cpp View original →

Read in other languages: 한국어日本語
LLM Apr 1, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A notable LocalLLaMA thread this week centered on attn-rot, ggerganov's open pull request to add activation rotation to llama.cpp for better quantization. The Reddit post packages benchmark tables, but the most useful details are in PR #21038 itself.

The proposal is deliberately simple. llama.cpp rotates Q, K, and V with a normalized Hadamard transform before caching, performs attention in the rotated space, then rotates the output back. Because the transform preserves dot products, the attention math still works while outliers are reduced, which in turn improves quantization behavior. The PR says it is backend-agnostic, introduces no new data types, and can work with existing quantization formats. As of April 1, 2026, the PR is still open, MLA is not supported, and an environment variable LLAMA_ATTN_ROT_DISABLE exists for disabling rotations.

The interesting part is how much quality it appears to recover at low precision. In the PR benchmarks, Qwen3 0.6B with q5_1 KV cache drops from a perplexity of 61.6992 on master to 14.1452 in the PR, and q4_1 falls from 212.479 to 22.2816. On larger models the gains are smaller but still directionally positive. The Reddit thread also surfaced KLD and tokens-per-second measurements suggesting that quality improves while throughput stays roughly in the same band, especially when models already use hybrid attention layouts. In a follow-up comment, ggerganov said the relative overhead is smaller for hybrid models like Qwen3.5 and at larger contexts, and that the branch looked good to merge from his side.

If this lands, it matters because llama.cpp sits at the center of local inference. A change that upgrades KV cache quantization without inventing a new format or sacrificing portability would immediately affect edge and desktop deployments. That is why the Reddit excitement is justified: this is the kind of low-level inference improvement that quietly widens what users can run locally.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.