How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090
Original: Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) View original →
Overview
A LocalLLaMA community member shared an impressive benchmark running Qwen3.5 27B dense on consumer hardware: 100+ t/s decode speed with 170k context window and approximately 1,500 t/s prefill, achieved on a dual RTX 3090 system with NVLink.
Hardware Setup
The configuration uses two RTX 3090 GPUs connected via NVLink. The developer notes that NVLink plays a significant role in tensor parallelism performance by providing high-bandwidth GPU interconnect, unlike PCIe-only setups.
Software Optimizations
Key optimizations that achieved these results:
- vLLM with tensor parallelism enabled
- MTP (Multi-Token Prediction) set to 5 predicted tokens (higher than the documented recommendation of 3)
- Mean acceptance length consistently above 3, validating the higher MTP setting
The developer found values above 5 offered diminishing returns, making 5 the optimal setting for this hardware.
Real-World Performance
Even in worst-case scenarios involving complex reasoning tasks, decode speed rarely drops below 60 t/s. For multi-user workloads, 585 t/s aggregate throughput across 8 simultaneous requests was observed — sufficient for a production serving environment on consumer hardware.
Significance
This demonstration shows that production-grade LLM serving of a 27B dense model is achievable on dual consumer GPUs without cloud infrastructure. The practical guide offers developers a concrete reference for building cost-efficient local AI deployments.
Related Articles
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
Alibaba released the Qwen3.5 small model series (0.8B, 4B, 9B). The 9B model achieves performance comparable to GPT-oss 20B–120B, making high-quality local inference accessible to users with modest GPU hardware.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
Comments (0)
No comments yet. Be the first to comment!