LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth
Original: Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results View original →
Why this benchmark mattered to Reddit
This post landed because it challenged a piece of local-LLM common sense that many users had stopped questioning: q8_0 KV cache is practically lossless. The Substack benchmark turns that into data instead of folklore, and the Reddit thread immediately recognized the value. By crawl time the post had 324 points and 58 comments. That is a strong signal for a methodology-heavy LocalLLaMA submission. The community response was not “interesting chart.” It was closer to “this changes how people should size memory tradeoffs on real deployments.”
What the benchmark actually measured
The methodology is clean. The author loaded the same BF16 GGUF three times on the same machine, changing only KV-cache precision between f16, q8_0, and q4_0. The tests used roughly 250,000 tokens across six categories, with KL divergence computed token by token against the f16-cache reference. The post also notes that llama.cpp’s recently added TurboQuant-inspired attention rotation was active. That matters because it means the results are trying to describe current practical behavior, not an outdated baseline. The headline result is simple but uncomfortable for anyone who had treated q8_0 as a safe default: it is not uniformly safe.
Where the damage shows up
The sharpest numbers are on the Gemma side. The post reports Gemma 31B at KL 0.108 with q8_0 cache, while Gemma 4 26B A4B jumps to 0.377. At q4_0, the latter reaches KL 1.088 with 68.0% top-1. Qwen behaves very differently. The benchmark says both tested Qwen models stay below KL 0.04 at q8_0, and even q4_0 cache remains in a more usable 0.087–0.117 range overall, though long-document behavior degrades more sharply. That difference is what gave the thread legs. The post is not arguing that one cache format is globally good or bad. It is showing that model architecture and cache sensitivity interact in ways that can make a universal rule misleading.
What the community added
The Reddit comments pushed in useful directions. One high-signal reply suggested Gemma’s degradation may be tied to continued quantization of the SWA cache and asked what happens if that stays at higher precision. Another commenter clarified implementation history around attention rotation in llama.cpp, pushing back on simplistic “inspired by X” storytelling and pointing to earlier discussion in the project. In other words, the community did what LocalLLaMA does best when it is at full quality: it treated a benchmark post as the start of implementation analysis, not the end. That makes this more than a graph dump. It becomes an operational note for anyone making VRAM-versus-quality tradeoffs on local inference stacks.
Sources: Localbench benchmark post · Reddit discussion
Related Articles
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
The LocalLLaMA thread cared less about a release headline and more about which Qwen3.6 GGUF quant actually works. Unsloth’s benchmark post pushed the discussion into KLD, disk size, CUDA 13.2 failures, and the messy details that decide local inference quality.
r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.
Comments (0)
No comments yet. Be the first to comment!