LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

What the Reddit post claimed

A post in r/LocalLLaMA reached 114 upvotes and 185 comments at crawl time by publishing unusually concrete inference numbers for a local two-GPU setup. The author said they spent a week optimizing a server built around 2x RTX PRO 6000 Blackwell cards with 96GB GDDR7 each, an AMD EPYC 4564P, 128GB DDR5 ECC, and a c-payne PM50100 Gen5 PCIe switch. The headline number was Qwen3.5-122B at 198 tok/s for single-user decode, alongside the claim that the result was verified across three runs at roughly 197, 200, and 198 tok/s.

How much of it is actually documented

This was not a screenshot-only brag post. The Reddit thread linked a public GitHub repository containing methodology notes, launch commands, and raw benchmark JSON files. In the repository’s results.md, the best dual-GPU configuration is listed as Qwen3.5-122B NVFP4 on SGLang b12x+NEXTN at 198 tok/s. One published verification JSON reports 200.33 aggregate tokens per second for the single-concurrency run on April 8, 2026. The same table also places Qwen3.5-27B FP8 at 170 tok/s, MiniMax M2.5 at 148 tok/s, and a Qwen3.5-397B GGUF run at 79 tok/s fully in VRAM, which helps anchor the headline number in a broader benchmark set instead of an isolated anecdote.

Why the build is faster than expected

The author’s explanation is that the speedup is not coming from one trick alone. The post credits the PCIe switch PIX topology, SGLang b12x MoE kernels, NEXTN speculative decoding, custom multi-GPU allreduce behavior, and a modelopt_fp4 checkpoint that works with those kernels. The public results file adds a useful comparison: the same repo reports 48.7 GB/s P2P bandwidth on the PLX-based setup versus 27.9 GB/s on a TRX40 path, and summarizes the best 122B result as +68% over the compared TRX40 baseline. In other words, the post is really a topology and software-stack story, not just a new GPU story.

What to take away

The caveat is that this is still a community benchmark centered on single-user decode throughput, not a universal application-level score. The thread itself notes that TTFT rises with context length even when decode speed stays close to 198 tok/s, with examples such as 4K context at 1.8 seconds and 150K context at 23.3 seconds. Even so, the post matters because it is more reproducible than most social-media performance claims: the hardware list is explicit, the software stack is named, and raw JSON artifacts are public. For local inference practitioners, that makes the thread useful not just as hype but as a concrete tuning reference for how far dual Blackwell systems can be pushed today.

Reddit discussion thread · Benchmark results · Raw verification JSON

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

What the Reddit post claimed

How much of it is actually documented

Why the build is faster than expected

What to take away

Related Articles

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

Related Articles

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs
LLM Reddit Apr 30, 2026 2 min read

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
LLM Reddit Mar 8, 2026 2 min read