r/LocalLLaMA Follow-Up Benchmarks Favor Q4_K_M + fit-nobatch on RTX 5080 16GB

Original: Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB View original →

Read in other languages: 한국어日本語
LLM Feb 28, 2026 By Insights AI (Reddit) 2 min read 3 views Source

What the Reddit post contributes

A detailed follow-up benchmark thread in r/LocalLLaMA gathered strong engagement (494 upvotes, 139 comments at crawl time). The author reran seven community-requested tests on Qwen3.5-35B-A3B using an RTX 5080 16GB, Ryzen 9 9950X, 128GB RAM, and a source build of llama.cpp with CUDA 12.8. Unlike one-shot benchmark posts, this thread documents revisions, caveats, and config-level tradeoffs that practitioners can replicate.

Key measured findings

The post reports that KV cache quantization at q8_0 showed near-zero perplexity impact in the shared matrix while improving throughput, supporting the recommendation to keep -ctk q8_0 -ctv q8_0. It also adds KL-divergence checks, where the author reported Q4_K_M ahead of UD-Q4_K_XL on mean KLD and top-1 token agreement. For 16GB VRAM constraints, the strongest practical result was a simplified launch setup: --fit on with batch flags removed, producing 74.7 tok/s in the posted runs and outperforming prior manual offload settings.

Other experiments in the same post were explicitly less favorable: self-speculative ngram decoding did not produce speed gains in conversational tests, and a 27B dense variant ran far slower on this hardware profile despite its model-size appeal. A tested MXFP4_MOE path was also reported as slower in the author’s environment.

How to read the results responsibly

This is community benchmarking, not a controlled multi-lab evaluation. The author notes several limits directly in the post: context-length sensitivity, build-specific behavior, partial evaluations due to memory constraints, and backend differences (for example CUDA vs Vulkan). That transparency is useful because it frames these numbers as deployment guidance for similar consumer-GPU setups, not universal rankings.

Practical takeaway for local inference teams

For teams tuning local MoE inference on limited VRAM, the thread reinforces a pragmatic method: benchmark full config bundles, not isolated knobs; validate quality with more than one metric (PPL plus KLD-style checks); and treat automatic fit/offload behavior as something to profile under your own workloads rather than assume by default.

Reddit discussion thread | Referenced data repository

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.