LocalLLaMA Loves the 80 TPS Qwen3.6 Demo, Then Immediately Starts Auditing the Fine Print

LocalLLaMA liked the headline number, but it did not stop at the headline. A user reported running Qwen3.6-27B at about 80 tokens per second with a 218k context window on a single RTX 5090 using vLLM 0.19.1rc1 and an NVFP4 plus MTP build shared on Hugging Face. In a subreddit that measures progress in VRAM, throughput, and what can fit on one real machine, that combination was enough to light up the thread.

The appeal was obvious. This was not another cloud-cluster brag or a vague benchmark image. It was a reproducible local recipe tied to one GPU, one serving stack, and one very practical promise: long context without giving up interactive speed. That is exactly the kind of post LocalLLaMA tends to reward. It compresses the abstract model race into a question people can actually act on. If you own the hardware, what can you run today, and how fast can you push it?

The comments, though, were classic LocalLLaMA: excitement followed immediately by audit mode. Readers asked what prompt lengths were used, pointing out that context-window claims mean little without real prompt occupancy. Others suggested trying DFlash or moving to Q8 if the acceptance rate held up. Another warning cut the celebration down to size by arguing the chosen quantization had weak KLD characteristics. Even basic tooling questions mattered. People wanted to know when vLLM meaningfully beats something like LM Studio, because deployment friction is part of the benchmark too.

That mix of hype and nitpicking is why the post worked. LocalLLaMA was not really voting for a screenshot. It was voting for a local-inference recipe that looks close enough to practical use that people can start poking holes in it. The result is more useful than raw cheerleading. By the end of the thread, the number that mattered was not just 80 tps. It was how much of that result survives once context length, quantization quality, and real workloads are spelled out. Sources: the Reddit thread and the linked Hugging Face model page.

LocalLLaMA Loves the 80 TPS Qwen3.6 Demo, Then Immediately Starts Auditing the Fine Print

Related Articles

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

Cohere W4A8 vLLM path claims 58% faster first-token latency

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090
#qwen #vllm #rtx-5090
1

Cohere W4A8 vLLM path claims 58% faster first-token latency

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live