LocalLLaMA Loves the 80 TPS Qwen3.6 Demo, Then Immediately Starts Auditing the Fine Print
Original: Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 View original →
LocalLLaMA liked the headline number, but it did not stop at the headline. A user reported running Qwen3.6-27B at about 80 tokens per second with a 218k context window on a single RTX 5090 using vLLM 0.19.1rc1 and an NVFP4 plus MTP build shared on Hugging Face. In a subreddit that measures progress in VRAM, throughput, and what can fit on one real machine, that combination was enough to light up the thread.
The appeal was obvious. This was not another cloud-cluster brag or a vague benchmark image. It was a reproducible local recipe tied to one GPU, one serving stack, and one very practical promise: long context without giving up interactive speed. That is exactly the kind of post LocalLLaMA tends to reward. It compresses the abstract model race into a question people can actually act on. If you own the hardware, what can you run today, and how fast can you push it?
The comments, though, were classic LocalLLaMA: excitement followed immediately by audit mode. Readers asked what prompt lengths were used, pointing out that context-window claims mean little without real prompt occupancy. Others suggested trying DFlash or moving to Q8 if the acceptance rate held up. Another warning cut the celebration down to size by arguing the chosen quantization had weak KLD characteristics. Even basic tooling questions mattered. People wanted to know when vLLM meaningfully beats something like LM Studio, because deployment friction is part of the benchmark too.
That mix of hype and nitpicking is why the post worked. LocalLLaMA was not really voting for a screenshot. It was voting for a local-inference recipe that looks close enough to practical use that people can start poking holes in it. The result is more useful than raw cheerleading. By the end of the thread, the number that mattered was not just 80 tps. It was how much of that result survives once context length, quantization quality, and real workloads are spelled out. Sources: the Reddit thread and the linked Hugging Face model page.
Related Articles
r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.
Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
Comments (0)
No comments yet. Be the first to comment!