How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090

Overview

A LocalLLaMA community member shared an impressive benchmark running Qwen3.5 27B dense on consumer hardware: 100+ t/s decode speed with 170k context window and approximately 1,500 t/s prefill, achieved on a dual RTX 3090 system with NVLink.

Hardware Setup

The configuration uses two RTX 3090 GPUs connected via NVLink. The developer notes that NVLink plays a significant role in tensor parallelism performance by providing high-bandwidth GPU interconnect, unlike PCIe-only setups.

Software Optimizations

Key optimizations that achieved these results:

vLLM with tensor parallelism enabled
MTP (Multi-Token Prediction) set to 5 predicted tokens (higher than the documented recommendation of 3)
Mean acceptance length consistently above 3, validating the higher MTP setting

The developer found values above 5 offered diminishing returns, making 5 the optimal setting for this hardware.

Real-World Performance

Even in worst-case scenarios involving complex reasoning tasks, decode speed rarely drops below 60 t/s. For multi-user workloads, 585 t/s aggregate throughput across 8 simultaneous requests was observed — sufficient for a production serving environment on consumer hardware.

Significance

This demonstration shows that production-grade LLM serving of a 27B dense model is achievable on dual consumer GPUs without cloud infrastructure. The practical guide offers developers a concrete reference for building cost-efficient local AI deployments.

LLM Reddit 1d ago 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM Reddit 15h ago 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.

#qwen #vllm #nvidia-b200