#quantization

LLM Reddit 8h ago 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit 16h ago 2 min read

LocalLLaMA Spots a Quantization Trap: Gemma 4 Breaks Sooner Than Qwen 3.6

LocalLLaMA paid attention because this post breaks a default assumption: q8_0 KV cache is not “practically lossless” for every model. Gemma 4 degrades much earlier than Qwen 3.6, and the thread quickly moved into SWA cache and long-context implications.

#kv-cache #quantization #gemma-4

LLM Reddit 1d ago 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM Reddit 2d ago 2 min read

LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

LocalLLaMA reacted because the post did not just tweak a benchmark table. It went after a widely repeated local-inference assumption and showed that the answer changes sharply by model family, especially for Gemma. By crawl time on April 25, 2026, the thread had 324 points and 58 comments.

#kv-cache #gemma #qwen

LLM sources.twitter 3d ago 1 min read

Cohere W4A8 vLLM path claims 58% faster first-token latency

Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.

#cohere #vllm #inference

LLM Reddit Apr 20, 2026 2 min read

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local

r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.

#qwen #local-llm #coding-agents

LLM sources.twitter Apr 16, 2026 2 min read

Ternary Bonsai squeezes 8B models to 1.75GB at 1.58 bits

PrismML is testing whether smaller open models can stay useful by changing the weight format, not only the architecture. Ternary Bonsai ships 8B, 4B and 1.7B models at 1.58 bits, with the 8B variant listed at 1.75GB.

#ternary-bonsai #open-models #huggingface

LLM Reddit Apr 16, 2026 1 min read

LocalLLaMA Wants Qwen3.5-9B Quant Choices Backed by KLD, Not Vibes

LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.

#qwen #gguf #quantization

LLM Reddit Apr 16, 2026 1 min read

A 290MB 1-Bit LLM in the Browser Gives LocalLLaMA Both Delight and Doubt

LocalLLaMA reacted with genuine wonder because the demo is simple to grasp: a 1.7B Bonsai model, about 290MB, running in a browser through WebGPU. The same thread also did the useful reality check, asking about tokens per second, hallucinations, llama.cpp support, and whether 1-bit models are ready for anything beyond narrow tasks.

#local-llm #webgpu #quantization

LLM sources.twitter Apr 14, 2026 2 min read

Quantized Gemma 4 31B nearly doubles throughput at half memory

Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.

#gemma-4 #quantization #vllm

LLM Reddit Apr 14, 2026 2 min read

r/LocalLLaMA Re-ranks Qwen3.5-9B Quants With KLD Instead of Guesswork

r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.

#qwen #quantization #gguf

LLM Reddit Apr 2, 2026 2 min read

Reddit tests PrismML’s Bonsai 1-bit models beyond the announcement hype

A strong r/LocalLLaMA reaction suggests PrismML’s Bonsai launch is landing as more than another compression headline. The discussion combines the company’s end-to-end 1-bit claims with early hands-on reports that the models feel materially more usable than earlier BitNet-style experiments.

#bonsai #1-bit #edge-ai