LocalLLaMA Spots a Quantization Trap: Gemma 4 Breaks Sooner Than Qwen 3.6

Original: Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results View original →

Read in other languages: 한국어日本語
LLM Apr 26, 2026 By Insights AI (Reddit) 2 min read 1 views Source

The familiar rule of thumb took a hit

r/LocalLLaMA reacted to this benchmark because it cuts against a piece of folk wisdom many local users rely on: q8_0 KV cache is supposed to be close enough to lossless that most people stop worrying about it. The linked LocalBench write-up suggests that rule is far less universal than many users assumed, and that model-specific sensitivity can be severe.

The test setup is unusually useful because it isolates one variable. The author loads the same BF16 GGUF three times on the same machine and only changes KV cache precision across f16, q8_0, and q4_0. The evaluation covers about 250,000 tokens across six categories and measures token-by-token KL divergence between the top-40 log-probability distributions of the baseline and quantized-cache runs. In the article’s framing, q8_0 halves KV cache memory and q4_0 quarters it. The results diverge sharply by model. Gemma 31B reaches KL 0.108 at q8_0, while Gemma 4 26B A4B jumps to 0.377 at q8_0 and 1.088 at q4_0. By contrast, both Qwen 3.6 models stay below 0.04 at q8_0, and even their q4_0 numbers remain in a more survivable 0.087-0.117 range.

What the subreddit added

The comment section did more than cheer the table. The top reaction speculated that Gemma’s degradation may be tied to the decision to keep quantizing the SWA cache, and asked how much that choice affects real downstream matching and task behavior. Other commenters wanted the same measurements repeated at much longer context lengths, asking whether the 30k-ish setup understates damage at 100k or 200k. A few readers also raised methodology questions about which tokens were included in the KL calculation, noting that Gemma’s token distributions may become especially chaotic outside assistant turns. That is exactly why the post felt high-signal to the community: it turns cache quantization from vague lore into something measurable and falsifiable.

Why it matters

Local inference users often treat cache precision and context length as generic knobs that behave the same across families. This benchmark says otherwise. The same q8_0 choice can be nearly harmless on one model and much more damaging on another, and category-level damage can concentrate differently across coding, tool use, science, or long documents. In practical terms, local optimization now looks less like “always lower precision until it hurts” and more like model-specific workload tuning. That is why the LocalLLaMA thread traveled: not because it crowned a winner, but because it showed that the hidden cost of memory savings is highly uneven.

Source: LocalBench article · r/LocalLLaMA thread

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.