Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

The LocalLLaMA thread 1sw21op landed because it hit the community’s favorite intersection: real hardware, real settings, and numbers that sound barely plausible until someone posts the launch flags. The author said Qwen3.6-27B-INT4 reached 105-108 tokens per second on a single RTX 5090 while keeping the model’s full 256k native context window under vLLM 0.19.

The post attributes the jump to a Lorbus AutoRound INT4 quant, fp8 KV cache, and MTP-based speculative decoding. The shared launch configuration used --max-model-len 262144, --kv-cache-dtype fp8_e4m3, --quantization auto_round, and an MTP speculative config with three speculative tokens. Just as important, the author framed it as an iteration on the previous day’s 80 tps / 218k context setup rather than a one-off screenshot. That made the thread feel like a reproducible tuning recipe instead of pure bench theater.

The comments are what gave the post its shape. People did not just cheer the number. They immediately asked where the quality landed relative to familiar Unsloth quants, whether coding-agent performance held up, and what the tradeoffs looked like on smaller 16GB or 24GB VRAM systems. One detailed reply added a 24GB RTX 3090 datapoint at 71-83 tok/s after warmup and linked the result to turboquant-style KV compression, MTP, cudagraph mode choices, and chunked prefill behavior.

Claimed throughput: 105-108 tps on one RTX 5090.
Claimed context length: full native 256k for Qwen3.6-27B.
Shared ingredients: AutoRound INT4 quant, fp8 KV cache, MTP speculative decoding, and vLLM 0.19 launch settings.

That community reaction is the real story. In local inference circles, a speed post only matters if other people can map it onto their own rigs and decide whether the quality loss is acceptable. This thread got traction because it suggested that a 27B-class local model might be entering a new usability band: fast enough to feel interactive, long-context enough to be practical, and still open to aggressive replication and skepticism.

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

Related Articles

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA Reacts to CoPaw-9B With Interest in Small Agent Models

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090
#qwen #vllm #rtx-5090
2

r/LocalLLaMA Reacts to CoPaw-9B With Interest in Small Agent Models
LLM Reddit Mar 31, 2026 2 min read

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local
LLM Reddit Apr 20, 2026 2 min read