r/LocalLLaMA Re-ranks Qwen3.5-9B Quants With KLD Instead of Guesswork
Original: Updated Qwen3.5-9B Quantization Comparison View original →
Quantization comparison posts are common on r/LocalLLaMA, but many of them still end up as reputation contests or machine-specific anecdotes. This one landed because it tried to give the community something more portable: a distribution-based way to judge how far a quant drifts from the original model. The author ranks Qwen3.5-9B community GGUF files by mean KLD against a BF16 baseline and frames that as a cleaner measure of faithfulness than “it felt good on my box.” That is exactly the kind of utility post the subreddit tends to reward.
The argument in the post is straightforward. Perplexity can be noisy because it leans on a particular evaluation set and can move around for reasons that do not always reflect information loss cleanly. KLD, by contrast, compares the quantized model’s probability distribution directly against the baseline distribution. In the ranking table, Q8_0 variants dominate the top of the fidelity chart: eaddario’s Q8_0 comes in at 0.001198 KLD, unsloth’s UD-Q8_K_XL at 0.001243, and bartowski’s Q8_0 at 0.001405. Once file size is folded back in through the post’s efficiency score, a different set of winners emerges, with several IQ4_XS, IQ4_NL, and Q5_K_S options looking more attractive for real-world memory budgets.
The practical value is in the details the author included. This was not just a chart drop. The post also lists the evaluation dataset, the context setting of 103 chunks at -c 512, the exact ik_llama.cpp build, and the NVIDIA driver version 595.97. That is why the comments immediately moved toward “do Gemma 4 next,” “what about MoE,” and “please add i1 quants.” People were treating the work as a reusable benchmark scaffold, not a one-off screenshot. One commenter even pointed out that mradermacher’s i1 quants seem to punch above their weight, which is the kind of concrete follow-up you only get when readers trust the setup.
The useful readout is simple. If your top priority is minimal drift from BF16, Q8_0 class files still look strongest. If you care more about size-to-fidelity balance, the post suggests that several IQ4 and Q5 variants deserve more attention than the community usually gives them. The original discussion is on r/LocalLLaMA, and the evaluation dataset is linked in the post via this gist. The energy around this thread comes from a familiar frustration: local-LLM users are tired of choosing quants by folklore and want something closer to a measurement culture.
Related Articles
A Reddit post in r/LocalLLaMA introduces a GGUF release of Qwen3.5-122B-A10B Uncensored (Aggressive) alongside new K_P quants. The author claims 0/465 refusals and zero capability loss, but those results are presented as the author’s own tests rather than independent verification.
A popular r/LocalLLaMA post highlighted a community merge of uncensored and reasoning-distilled Qwen 3.5 9B checkpoints, underscoring the appetite for behavior-tuned small local models.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
Comments (0)
No comments yet. Be the first to comment!