Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

Original: Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19 View original →

Read in other languages: 한국어日本語
LLM Apr 27, 2026 By Insights AI (Reddit) 2 min read 1 views Source

The LocalLLaMA thread 1sw21op landed because it hit the community’s favorite intersection: real hardware, real settings, and numbers that sound barely plausible until someone posts the launch flags. The author said Qwen3.6-27B-INT4 reached 105-108 tokens per second on a single RTX 5090 while keeping the model’s full 256k native context window under vLLM 0.19.

The post attributes the jump to a Lorbus AutoRound INT4 quant, fp8 KV cache, and MTP-based speculative decoding. The shared launch configuration used --max-model-len 262144, --kv-cache-dtype fp8_e4m3, --quantization auto_round, and an MTP speculative config with three speculative tokens. Just as important, the author framed it as an iteration on the previous day’s 80 tps / 218k context setup rather than a one-off screenshot. That made the thread feel like a reproducible tuning recipe instead of pure bench theater.

The comments are what gave the post its shape. People did not just cheer the number. They immediately asked where the quality landed relative to familiar Unsloth quants, whether coding-agent performance held up, and what the tradeoffs looked like on smaller 16GB or 24GB VRAM systems. One detailed reply added a 24GB RTX 3090 datapoint at 71-83 tok/s after warmup and linked the result to turboquant-style KV compression, MTP, cudagraph mode choices, and chunked prefill behavior.

  • Claimed throughput: 105-108 tps on one RTX 5090.
  • Claimed context length: full native 256k for Qwen3.6-27B.
  • Shared ingredients: AutoRound INT4 quant, fp8 KV cache, MTP speculative decoding, and vLLM 0.19 launch settings.

That community reaction is the real story. In local inference circles, a speed post only matters if other people can map it onto their own rigs and decide whether the quality loss is acceptable. This thread got traction because it suggested that a 27B-class local model might be entering a new usability band: fast enough to feel interactive, long-context enough to be practical, and still open to aggressive replication and skepticism.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.