How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090
Original: Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) View original →
Overview
A LocalLLaMA community member shared an impressive benchmark running Qwen3.5 27B dense on consumer hardware: 100+ t/s decode speed with 170k context window and approximately 1,500 t/s prefill, achieved on a dual RTX 3090 system with NVLink.
Hardware Setup
The configuration uses two RTX 3090 GPUs connected via NVLink. The developer notes that NVLink plays a significant role in tensor parallelism performance by providing high-bandwidth GPU interconnect, unlike PCIe-only setups.
Software Optimizations
Key optimizations that achieved these results:
- vLLM with tensor parallelism enabled
- MTP (Multi-Token Prediction) set to 5 predicted tokens (higher than the documented recommendation of 3)
- Mean acceptance length consistently above 3, validating the higher MTP setting
The developer found values above 5 offered diminishing returns, making 5 the optimal setting for this hardware.
Real-World Performance
Even in worst-case scenarios involving complex reasoning tasks, decode speed rarely drops below 60 t/s. For multi-user workloads, 585 t/s aggregate throughput across 8 simultaneous requests was observed — sufficient for a production serving environment on consumer hardware.
Significance
This demonstration shows that production-grade LLM serving of a 27B dense model is achievable on dual consumer GPUs without cloud infrastructure. The practical guide offers developers a concrete reference for building cost-efficient local AI deployments.
Related Articles
r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.
LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.
A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.
Comments (0)
No comments yet. Be the first to comment!