LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

Local model communities care less about glossy launch charts than about what actually fits on a desk. That is why this post broke through. The author claimed a Qwen3.6-27B setup using an NVFP4+MTP Hugging Face variant and vLLM 0.19.1rc1 could hit about 80 tokens per second on a single RTX 5090 while serving a 218k context window. In LocalLLaMA terms, that is not a vague “feels fast” impression. It is the kind of number that changes what people think a one-GPU workstation can reasonably do.

The linked model card helps explain why the thread mattered. Qwen3.6-27B-Text-NVFP4-MTP is a text-only NVFP4-quantized sibling of Qwen/Qwen3.6-27B with the MTP head restored in bf16 so speculative decoding actually works. The repo is tuned for Blackwell-class hardware, uses the faster modelopt path, and explicitly says it should run on RTX 5090-class cards. The interesting part is not magic; it is systems work. Quantization, speculative decoding, and runtime choices are doing a lot of the lifting here.

The comment section immediately pulled the discussion back to practical questions. One reader asked the obvious operational question: what does vLLM buy over LM Studio in normal use? Another pushed on benchmark realism, noting that a giant theoretical context window matters less than the prompt length used in the actual test, especially for coding agents that quickly burn 30k to 40k tokens. Others questioned whether the speed story comes mostly from aggressive quantization and what that means for quality in return.

That mix of excitement and skepticism is what makes the post useful. LocalLLaMA did not read it as proof that local inference is solved. It read it as evidence that the ceiling keeps moving. If a 27B model can plausibly become a high-context, high-throughput workstation model on one flagship consumer GPU, the conversation shifts from “can local compete?” to “what counts as a normal local setup now?” The sources are the Reddit thread and the Hugging Face model card.

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

Related Articles

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090

r/LocalLLaMA Reacts to CoPaw-9B With Interest in Small Agent Models

Related Articles

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality
LLM Reddit Apr 27, 2026 2 min read

How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090
LLM Reddit Mar 2, 2026 1 min read

r/LocalLLaMA Reacts to CoPaw-9B With Interest in Small Agent Models
LLM Reddit Mar 31, 2026 2 min read