LocalLLaMA Wants Qwen3.5-9B Quant Choices Backed by KLD, Not Vibes

The LocalLLaMA comparison for Qwen3.5-9B quantizations landed because it solved a very practical problem: there are too many GGUF files, and their names are not enough guidance. Instead of telling users to pick a popular upload, the post compares community quants against a BF16 baseline using mean KLD, or KL Divergence. In the author's framing, lower KLD means the quantized model's probability distribution stays closer to the original weights.

That metric choice is why the thread had technical weight. Perplexity can be noisy and dataset-sensitive; it can improve by accident on a test slice even when the model has drifted. KLD is not magic, but it directly asks how much the quantized distribution moved away from the baseline. For local users choosing between Q8_0, Q4 variants, i-quants, and provider-specific builds, that is a more useful starting point than file size alone.

The table highlighted near-lossless Q8-style options at the top, with multiple entries under a KLD score of 0.01. Commenters treated that as a shared reference rather than a final answer. Some asked for Gemma 4 and larger Qwen runs. Others suggested improving the chart with different marker shapes for different quant publishers. A longer technical comment praised the efficiency calculation while asking for KLD at near-full context lengths, because quantization can hurt long-context behavior even when short-context numbers look fine.

That is the community energy here: LocalLLaMA is moving from casual model recommendations toward repeatable measurement. The post does not decide one universal best quant. It gives users a way to talk about tradeoffs among file size, BPW, KLD, PPL, memory fit, and workload. For local inference, that is often the difference between chasing a filename and making an informed deployment choice.

LLM Reddit Apr 14, 2026 1 min read

r/LocalLLaMA가 Qwen3.5-9B quant를 다시 세운 기준: 감이 아니라 KLD로 고르자

r/LocalLLaMA에서 이 비교가 먹힌 이유는 GGUF 파일 선택을 감이나 평판이 아니라 분포 차이로 설명했기 때문이다. 작성자는 BF16 baseline 대비 mean KLD를 기준으로 community quants를 정렬했고, Q8_0 계열은 fidelity 쪽 상단에, 여러 IQ4와 Q5 계열은 size 대 fidelity 균형 구간에 배치했다.

#qwen #quantization #gguf

LLM Hacker News Mar 8, 2026 1 min read

Qwen 3.5 로컬 실행 가이드, 메모리 요구량과 256K context, llama.cpp 설정 정리

Hacker News에서 주목받은 Unsloth의 Qwen3.5 가이드는 27B와 35B-A3B를 포함한 로컬 실행 경로를 메모리 요구량, thinking 제어, llama.cpp 명령 중심으로 정리한다.

#qwen #llama.cpp #local-llm

LLM Reddit Apr 18, 2026 1 min read

Qwen3.6 GGUF 논쟁, r/LocalLLaMA는 “어떤 quant를 돌릴 것인가”로 내려갔다

r/LocalLLaMA가 Qwen3.6 release 자체보다 GGUF quant 선택과 CUDA 버그에 더 크게 반응했다. Unsloth의 benchmark post는 KLD, disk space, 4bit gibberish, CUDA 13.1/13.3 같은 실제 실행 조건을 전면에 올렸다.

#qwen #gguf #local-llm

Related Articles

r/LocalLLaMA가 Qwen3.5-9B quant를 다시 세운 기준: 감이 아니라 KLD로 고르자

Qwen 3.5 로컬 실행 가이드, 메모리 요구량과 256K context, llama.cpp 설정 정리

Qwen3.6 GGUF 논쟁, r/LocalLLaMA는 “어떤 quant를 돌릴 것인가”로 내려갔다