LocalLLaMA Spots a Quantization Trap: Gemma 4 Breaks Sooner Than Qwen 3.6
Original: Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results View original →
The familiar rule of thumb took a hit
r/LocalLLaMA reacted to this benchmark because it cuts against a piece of folk wisdom many local users rely on: q8_0 KV cache is supposed to be close enough to lossless that most people stop worrying about it. The linked LocalBench write-up suggests that rule is far less universal than many users assumed, and that model-specific sensitivity can be severe.
The test setup is unusually useful because it isolates one variable. The author loads the same BF16 GGUF three times on the same machine and only changes KV cache precision across f16, q8_0, and q4_0. The evaluation covers about 250,000 tokens across six categories and measures token-by-token KL divergence between the top-40 log-probability distributions of the baseline and quantized-cache runs. In the article’s framing, q8_0 halves KV cache memory and q4_0 quarters it. The results diverge sharply by model. Gemma 31B reaches KL 0.108 at q8_0, while Gemma 4 26B A4B jumps to 0.377 at q8_0 and 1.088 at q4_0. By contrast, both Qwen 3.6 models stay below 0.04 at q8_0, and even their q4_0 numbers remain in a more survivable 0.087-0.117 range.
What the subreddit added
The comment section did more than cheer the table. The top reaction speculated that Gemma’s degradation may be tied to the decision to keep quantizing the SWA cache, and asked how much that choice affects real downstream matching and task behavior. Other commenters wanted the same measurements repeated at much longer context lengths, asking whether the 30k-ish setup understates damage at 100k or 200k. A few readers also raised methodology questions about which tokens were included in the KL calculation, noting that Gemma’s token distributions may become especially chaotic outside assistant turns. That is exactly why the post felt high-signal to the community: it turns cache quantization from vague lore into something measurable and falsifiable.
Why it matters
Local inference users often treat cache precision and context length as generic knobs that behave the same across families. This benchmark says otherwise. The same q8_0 choice can be nearly harmless on one model and much more damaging on another, and category-level damage can concentrate differently across coding, tool use, science, or long documents. In practical terms, local optimization now looks less like “always lower precision until it hurts” and more like model-specific workload tuning. That is why the LocalLLaMA thread traveled: not because it crowned a winner, but because it showed that the hidden cost of memory savings is highly uneven.
Source: LocalBench article · r/LocalLLaMA thread
Related Articles
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
r/LocalLLaMA is highlighting the merge of llama.cpp PR #21038, which applies a simple Hadamard-based rotation to Q, K, and V in attention as a lightweight path toward TurboQuant-like gains. The appeal is that it improves low-bit cache behavior without introducing a brand-new quantization format.
LocalLLaMA reacted because the post did not just tweak a benchmark table. It went after a widely repeated local-inference assumption and showed that the answer changes sharply by model family, especially for Gemma. By crawl time on April 25, 2026, the thread had 324 points and 58 comments.
Comments (0)
No comments yet. Be the first to comment!