LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090
Original: Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 View original →
Local model communities care less about glossy launch charts than about what actually fits on a desk. That is why this post broke through. The author claimed a Qwen3.6-27B setup using an NVFP4+MTP Hugging Face variant and vLLM 0.19.1rc1 could hit about 80 tokens per second on a single RTX 5090 while serving a 218k context window. In LocalLLaMA terms, that is not a vague “feels fast” impression. It is the kind of number that changes what people think a one-GPU workstation can reasonably do.
The linked model card helps explain why the thread mattered. Qwen3.6-27B-Text-NVFP4-MTP is a text-only NVFP4-quantized sibling of Qwen/Qwen3.6-27B with the MTP head restored in bf16 so speculative decoding actually works. The repo is tuned for Blackwell-class hardware, uses the faster modelopt path, and explicitly says it should run on RTX 5090-class cards. The interesting part is not magic; it is systems work. Quantization, speculative decoding, and runtime choices are doing a lot of the lifting here.
The comment section immediately pulled the discussion back to practical questions. One reader asked the obvious operational question: what does vLLM buy over LM Studio in normal use? Another pushed on benchmark realism, noting that a giant theoretical context window matters less than the prompt length used in the actual test, especially for coding agents that quickly burn 30k to 40k tokens. Others questioned whether the speed story comes mostly from aggressive quantization and what that means for quality in return.
That mix of excitement and skepticism is what makes the post useful. LocalLLaMA did not read it as proof that local inference is solved. It read it as evidence that the ceiling keeps moving. If a 27B model can plausibly become a high-context, high-throughput workstation model on one flagship consumer GPU, the conversation shifts from “can local compete?” to “what counts as a normal local setup now?” The sources are the Reddit thread and the Hugging Face model card.
Related Articles
A Reddit thread in r/LocalLLaMA drew 142 upvotes and 29 comments around CoPaw-9B. The discussion focused on its Qwen3.5-based 9B agent positioning, 262,144-token context window, and whether local users would get GGUF or other quantized builds quickly.
r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.
Comments (0)
No comments yet. Be the first to comment!