Qwen 3.6 27B’s quant test gave LocalLLaMA a favorite, and a methodology fight
Original: Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation View original →
This LocalLLaMA post hit a nerve because it delivered the kind of data the subreddit keeps asking for: not another vague “runs great on my box” claim, but a side-by-side quant comparison for Qwen 3.6 27B. The author tested BF16, Q4_K_M, and Q8_0 using llama-cpp-python with HumanEval, HellaSwag, and BFCL, then laid out both accuracy and throughput numbers. That alone was enough to get people’s attention.
The headline result was practical rather than dramatic. BF16 led on average accuracy at 69.78%, but it needed 54 GB of peak RAM and ran at 15.5 tokens per second. Q4_K_M came in at 66.54% average accuracy, almost matched BFCL, ran at 22.5 tokens per second, and cut peak RAM to 28 GB with a much smaller model file. Q8_0 looked less compelling in this particular run: slightly better HumanEval than Q4_K_M, but slower overall, heavier on memory, and weaker on HellaSwag. For many readers, that made Q4_K_M look like the real-world sweet spot.
What made the thread interesting is that the applause turned into skepticism almost immediately. The top comment said the community needs more comparisons like this. The next wave asked hard questions about methodology: where were the error bars, what KV-cache quantization was used, and how did Q8_0 end up behind Q4_K_M on some tests? One commenter flatly argued that the HumanEval numbers were far below what Qwen 3.6 27B should normally achieve, which raised the possibility that the setup, not just the quant choice, shaped the outcome.
That is why the post worked. It gave LocalLLaMA both things it wants at once: a concrete deployment tradeoff and something technical to argue about. The immediate takeaway was that Q4_K_M may be the best balance for people who care about RAM and speed more than squeezing out every last point. The deeper takeaway was that reproducible local benchmarking still needs cleaner methodology if it wants to settle arguments instead of starting new ones.
Related Articles
A high-scoring LocalLLaMA post benchmarked Qwen3.5-27B Q4 GGUF variants against BF16, separating “closest-to-baseline” choices from “best efficiency” picks for constrained VRAM setups.
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
Comments (0)
No comments yet. Be the first to comment!