LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

Original: Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub View original →

Read in other languages: 한국어日本語
LLM Mar 28, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A throughput number big enough to force people to look at the setup

A March 26, 2026 r/LocalLLaMA post linked a Google Cloud community write-up describing how Qwen 3.5 27B was pushed to 1,103,941 total tokens per second on 12 nodes with 96 NVIDIA B200 GPUs using vLLM. The Reddit thread reached 205 points and 52 comments at crawl time. The headline number is large, but the more useful part of the write-up is that it documents the failed paths and tuning decisions instead of pretending the result was obvious.

The author says the model choice was deliberate. Qwen 3.5 27B is the dense variant, not an MoE sibling, so every parameter is active on every token. That makes it harder to accelerate than sparse alternatives, but also makes the result more interesting for operators who care about quality under heavy output workloads. The post also notes the model's hybrid GDN plus grouped-query-attention design, 262K native context window, and Apache 2.0 license.

Why the serving strategy changed the outcome

The write-up says the first instinct was tensor parallelism across eight GPUs per node. That only lifted throughput from roughly 9,500 to about 22,300 tokens per second, because synchronization overhead dominated. Switching to data parallelism, so that each GPU hosted a full copy of the roughly 29 GB model with no inter-GPU coordination, immediately pushed the system to about 74,848 tokens per second. From there, context-window tuning mattered more than many teams might expect: dropping the configured maximum length from 131K to a few thousand tokens freed KV-cache capacity and lifted throughput again.

The real breakthrough came from enabling FP8 KV cache and using MTP-1 speculative decoding in vLLM 0.18.0. In the article's measurements, removing MTP pulled throughput down by about a third and sent GPU compute back toward zero, while the optimized single-node setup hit about 96,000 tokens per second before multi-node scaling. The Reddit post adds two more high-level results: about 97.1% scaling efficiency at eight nodes and 96.5% at twelve, plus a reported 35% overhead penalty when using Inference Gateway with KV-aware routing instead of simpler ClusterIP round-robin.

Why LocalLLaMA paid attention

The thread resonated because it turns a flashy infrastructure benchmark into an operational playbook. The main lesson is not that B200s are fast. Everyone already expected that. The lesson is that serving stack choices such as TP versus DP, speculative decoding, KV-cache dtype, and realistic context sizing can matter more than the raw accelerator spec sheet. That is especially relevant for open-model deployments, where teams are often deciding whether to spend money on more hardware or first fix their inference configuration.

The post author disclosed working for Google Cloud, so the numbers should be read as an optimized vendor-affiliated result rather than a neutral baseline. Even so, the engineering details are concrete enough to matter, and the linked GitHub configs make the claim more reproducible than a typical marketing benchmark.

Primary source: Google Cloud community write-up. Community discussion: r/LocalLLaMA.

Share: Long

Related Articles

LLM Reddit Mar 15, 2026 2 min read

A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.