r/LocalLLaMA Follow-Up Benchmarks Favor Q4_K_M + fit-nobatch on RTX 5080 16GB

What the Reddit post contributes

A detailed follow-up benchmark thread in r/LocalLLaMA gathered strong engagement (494 upvotes, 139 comments at crawl time). The author reran seven community-requested tests on Qwen3.5-35B-A3B using an RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, and a source build of llama.cpp with CUDA 12.8. Unlike one-shot benchmark posts, this thread documents revisions, caveats, and config-level tradeoffs that practitioners can replicate.

Key measured findings

The post reports that KV cache quantization at q8_0 showed near-zero perplexity impact in the shared matrix while improving throughput, supporting the recommendation to keep -ctk q8_0 -ctv q8_0. It also adds KL-divergence checks, where the author reported Q4_K_M ahead of UD-Q4_K_XL on mean KLD and top-1 token agreement. For 16GB VRAM constraints, the strongest practical result was a simplified launch setup: --fit on with batch flags removed, producing 74.7 tok/s in the posted runs and outperforming prior manual offload settings.

Other experiments in the same post were explicitly less favorable: self-speculative ngram decoding did not produce speed gains in conversational tests, and a 27B dense variant ran far slower on this hardware profile despite its model-size appeal. A tested MXFP4_MOE path was also reported as slower in the author’s environment.

How to read the results responsibly

This is community benchmarking, not a controlled multi-lab evaluation. The author notes several limits directly in the post: context-length sensitivity, build-specific behavior, partial evaluations due to memory constraints, and backend differences (for example CUDA vs Vulkan). That transparency is useful because it frames these numbers as deployment guidance for similar consumer-GPU setups, not universal rankings.

Practical takeaway for local inference teams

For teams tuning local MoE inference on limited VRAM, the thread reinforces a pragmatic method: benchmark full config bundles, not isolated knobs; validate quality with more than one metric (PPL plus KLD-style checks); and treat automatic fit/offload behavior as something to profile under your own workloads rather than assume by default.

Reddit discussion thread | Referenced data repository

r/LocalLLaMA Follow-Up Benchmarks Favor Q4_K_M + fit-nobatch on RTX 5080 16GB

What the Reddit post contributes

Key measured findings

How to read the results responsibly

Practical takeaway for local inference teams

Related Articles

Qwen 3.6 27B’s quant test gave LocalLLaMA a favorite, and a methodology fight

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

r/LocalLLaMA benchmark compares Qwen3.5-27B Q4 quants using KLD and size tradeoffs

Related Articles

Qwen 3.6 27B’s quant test gave LocalLLaMA a favorite, and a methodology fight
LLM Reddit Apr 29, 2026 2 min read

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp
LLM Reddit May 22, 2026 1 min read

r/LocalLLaMA benchmark compares Qwen3.5-27B Q4 quants using KLD and size tradeoffs
LLM Reddit Mar 4, 2026 1 min read