How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090

Overview

A LocalLLaMA community member shared an impressive benchmark running Qwen3.5 27B dense on consumer hardware: 100+ t/s decode speed with 170k context window and approximately 1,500 t/s prefill, achieved on a dual RTX 3090 system with NVLink.

Hardware Setup

The configuration uses two RTX 3090 GPUs connected via NVLink. The developer notes that NVLink plays a significant role in tensor parallelism performance by providing high-bandwidth GPU interconnect, unlike PCIe-only setups.

Software Optimizations

Key optimizations that achieved these results:

vLLM with tensor parallelism enabled
MTP (Multi-Token Prediction) set to 5 predicted tokens (higher than the documented recommendation of 3)
Mean acceptance length consistently above 3, validating the higher MTP setting

The developer found values above 5 offered diminishing returns, making 5 the optimal setting for this hardware.

Real-World Performance

Even in worst-case scenarios involving complex reasoning tasks, decode speed rarely drops below 60 t/s. For multi-user workloads, 585 t/s aggregate throughput across 8 simultaneous requests was observed — sufficient for a production serving environment on consumer hardware.

Significance

This demonstration shows that production-grade LLM serving of a 27B dense model is achievable on dual consumer GPUs without cloud infrastructure. The practical guide offers developers a concrete reference for building cost-efficient local AI deployments.

LLM Reddit Apr 25, 2026 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM Reddit Apr 27, 2026 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit May 22, 2026 1 min read

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.

#llama-cpp #qwen #local-llm