LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

Original: Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results View original →

Read in other languages: 한국어日本語
LLM Apr 25, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why this benchmark mattered to Reddit

This post landed because it challenged a piece of local-LLM common sense that many users had stopped questioning: q8_0 KV cache is practically lossless. The Substack benchmark turns that into data instead of folklore, and the Reddit thread immediately recognized the value. By crawl time the post had 324 points and 58 comments. That is a strong signal for a methodology-heavy LocalLLaMA submission. The community response was not “interesting chart.” It was closer to “this changes how people should size memory tradeoffs on real deployments.”

What the benchmark actually measured

The methodology is clean. The author loaded the same BF16 GGUF three times on the same machine, changing only KV-cache precision between f16, q8_0, and q4_0. The tests used roughly 250,000 tokens across six categories, with KL divergence computed token by token against the f16-cache reference. The post also notes that llama.cpp’s recently added TurboQuant-inspired attention rotation was active. That matters because it means the results are trying to describe current practical behavior, not an outdated baseline. The headline result is simple but uncomfortable for anyone who had treated q8_0 as a safe default: it is not uniformly safe.

Where the damage shows up

The sharpest numbers are on the Gemma side. The post reports Gemma 31B at KL 0.108 with q8_0 cache, while Gemma 4 26B A4B jumps to 0.377. At q4_0, the latter reaches KL 1.088 with 68.0% top-1. Qwen behaves very differently. The benchmark says both tested Qwen models stay below KL 0.04 at q8_0, and even q4_0 cache remains in a more usable 0.087–0.117 range overall, though long-document behavior degrades more sharply. That difference is what gave the thread legs. The post is not arguing that one cache format is globally good or bad. It is showing that model architecture and cache sensitivity interact in ways that can make a universal rule misleading.

What the community added

The Reddit comments pushed in useful directions. One high-signal reply suggested Gemma’s degradation may be tied to continued quantization of the SWA cache and asked what happens if that stays at higher precision. Another commenter clarified implementation history around attention rotation in llama.cpp, pushing back on simplistic “inspired by X” storytelling and pointing to earlier discussion in the project. In other words, the community did what LocalLLaMA does best when it is at full quality: it treated a benchmark post as the start of implementation analysis, not the end. That makes this more than a graph dump. It becomes an operational note for anyone making VRAM-versus-quality tradeoffs on local inference stacks.

Sources: Localbench benchmark post · Reddit discussion

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.