How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090

Original: Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) View original →

Read in other languages: 한국어日本語
LLM Mar 2, 2026 By Insights AI (Reddit) 1 min read 5 views Source

Overview

A LocalLLaMA community member shared an impressive benchmark running Qwen3.5 27B dense on consumer hardware: 100+ t/s decode speed with 170k context window and approximately 1,500 t/s prefill, achieved on a dual RTX 3090 system with NVLink.

Hardware Setup

The configuration uses two RTX 3090 GPUs connected via NVLink. The developer notes that NVLink plays a significant role in tensor parallelism performance by providing high-bandwidth GPU interconnect, unlike PCIe-only setups.

Software Optimizations

Key optimizations that achieved these results:

  • vLLM with tensor parallelism enabled
  • MTP (Multi-Token Prediction) set to 5 predicted tokens (higher than the documented recommendation of 3)
  • Mean acceptance length consistently above 3, validating the higher MTP setting

The developer found values above 5 offered diminishing returns, making 5 the optimal setting for this hardware.

Real-World Performance

Even in worst-case scenarios involving complex reasoning tasks, decode speed rarely drops below 60 t/s. For multi-user workloads, 585 t/s aggregate throughput across 8 simultaneous requests was observed — sufficient for a production serving environment on consumer hardware.

Significance

This demonstration shows that production-grade LLM serving of a 27B dense model is achievable on dual consumer GPUs without cloud infrastructure. The practical guide offers developers a concrete reference for building cost-efficient local AI deployments.

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.