LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

Why this benchmark mattered to Reddit

This post landed because it challenged a piece of local-LLM common sense that many users had stopped questioning: q8_0 KV cache is practically lossless. The Substack benchmark turns that into data instead of folklore, and the Reddit thread immediately recognized the value. By crawl time the post had 324 points and 58 comments. That is a strong signal for a methodology-heavy LocalLLaMA submission. The community response was not “interesting chart.” It was closer to “this changes how people should size memory tradeoffs on real deployments.”

What the benchmark actually measured

The methodology is clean. The author loaded the same BF16 GGUF three times on the same machine, changing only KV-cache precision between f16, q8_0, and q4_0. The tests used roughly 250,000 tokens across six categories, with KL divergence computed token by token against the f16-cache reference. The post also notes that llama.cpp’s recently added TurboQuant-inspired attention rotation was active. That matters because it means the results are trying to describe current practical behavior, not an outdated baseline. The headline result is simple but uncomfortable for anyone who had treated q8_0 as a safe default: it is not uniformly safe.

Where the damage shows up

The sharpest numbers are on the Gemma side. The post reports Gemma 31B at KL 0.108 with q8_0 cache, while Gemma 4 26B A4B jumps to 0.377. At q4_0, the latter reaches KL 1.088 with 68.0% top-1. Qwen behaves very differently. The benchmark says both tested Qwen models stay below KL 0.04 at q8_0, and even q4_0 cache remains in a more usable 0.087–0.117 range overall, though long-document behavior degrades more sharply. That difference is what gave the thread legs. The post is not arguing that one cache format is globally good or bad. It is showing that model architecture and cache sensitivity interact in ways that can make a universal rule misleading.

What the community added

The Reddit comments pushed in useful directions. One high-signal reply suggested Gemma’s degradation may be tied to continued quantization of the SWA cache and asked what happens if that stays at higher precision. Another commenter clarified implementation history around attention rotation in llama.cpp, pushing back on simplistic “inspired by X” storytelling and pointing to earlier discussion in the project. In other words, the community did what LocalLLaMA does best when it is at full quality: it treated a benchmark post as the start of implementation analysis, not the end. That makes this more than a graph dump. It becomes an operational note for anyone making VRAM-versus-quality tradeoffs on local inference stacks.

Sources: Localbench benchmark post · Reddit discussion

LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

Why this benchmark mattered to Reddit

What the benchmark actually measured

Where the damage shows up

What the community added

Related Articles

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

Qwen3.6 excitement turned into a GGUF runtime checklist on r/LocalLLaMA

A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day

Comments (0)

Leave a Comment

Related Articles

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
LLM Reddit Apr 1, 2026 2 min read

Qwen3.6 excitement turned into a GGUF runtime checklist on r/LocalLLaMA
LLM Reddit Apr 18, 2026 1 min read

A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day
r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.