r/LocalLLaMA Follow-Up Benchmarks Favor Q4_K_M + fit-nobatch on RTX 5080 16GB
Original: Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB View original →
What the Reddit post contributes
A detailed follow-up benchmark thread in r/LocalLLaMA gathered strong engagement (494 upvotes, 139 comments at crawl time). The author reran seven community-requested tests on Qwen3.5-35B-A3B using an RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, and a source build of llama.cpp with CUDA 12.8. Unlike one-shot benchmark posts, this thread documents revisions, caveats, and config-level tradeoffs that practitioners can replicate.
Key measured findings
The post reports that KV cache quantization at q8_0 showed near-zero perplexity impact in the shared matrix while improving throughput, supporting the recommendation to keep -ctk q8_0 -ctv q8_0. It also adds KL-divergence checks, where the author reported Q4_K_M ahead of UD-Q4_K_XL on mean KLD and top-1 token agreement. For 16GB VRAM constraints, the strongest practical result was a simplified launch setup: --fit on with batch flags removed, producing 74.7 tok/s in the posted runs and outperforming prior manual offload settings.
Other experiments in the same post were explicitly less favorable: self-speculative ngram decoding did not produce speed gains in conversational tests, and a 27B dense variant ran far slower on this hardware profile despite its model-size appeal. A tested MXFP4_MOE path was also reported as slower in the author’s environment.
How to read the results responsibly
This is community benchmarking, not a controlled multi-lab evaluation. The author notes several limits directly in the post: context-length sensitivity, build-specific behavior, partial evaluations due to memory constraints, and backend differences (for example CUDA vs Vulkan). That transparency is useful because it frames these numbers as deployment guidance for similar consumer-GPU setups, not universal rankings.
Practical takeaway for local inference teams
For teams tuning local MoE inference on limited VRAM, the thread reinforces a pragmatic method: benchmark full config bundles, not isolated knobs; validate quality with more than one metric (PPL plus KLD-style checks); and treat automatic fit/offload behavior as something to profile under your own workloads rather than assume by default.
Related Articles
A LocalLLaMA thread highlighted ongoing work to add NVFP4 quantization support to llama.cpp GGUF, pointing to potential memory savings and higher throughput for compatible GPU setups.
A high-scoring LocalLLaMA post benchmarked Qwen3.5-27B Q4 GGUF variants against BF16, separating “closest-to-baseline” choices from “best efficiency” picks for constrained VRAM setups.
A popular LocalLLaMA post highlights draft PR #19726, where a contributor proposes porting IQ*_K quantization work from ik_llama.cpp into mainline llama.cpp with initial CPU backend support and early KLD checks.
Comments (0)
No comments yet. Be the first to comment!