LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell
Original: Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results View original →
What the Reddit post claimed
A post in r/LocalLLaMA reached 114 upvotes and 185 comments at crawl time by publishing unusually concrete inference numbers for a local two-GPU setup. The author said they spent a week optimizing a server built around 2x RTX PRO 6000 Blackwell cards with 96GB GDDR7 each, an AMD EPYC 4564P, 128GB DDR5 ECC, and a c-payne PM50100 Gen5 PCIe switch. The headline number was Qwen3.5-122B at 198 tok/s for single-user decode, alongside the claim that the result was verified across three runs at roughly 197, 200, and 198 tok/s.
How much of it is actually documented
This was not a screenshot-only brag post. The Reddit thread linked a public GitHub repository containing methodology notes, launch commands, and raw benchmark JSON files. In the repository’s results.md, the best dual-GPU configuration is listed as Qwen3.5-122B NVFP4 on SGLang b12x+NEXTN at 198 tok/s. One published verification JSON reports 200.33 aggregate tokens per second for the single-concurrency run on April 8, 2026. The same table also places Qwen3.5-27B FP8 at 170 tok/s, MiniMax M2.5 at 148 tok/s, and a Qwen3.5-397B GGUF run at 79 tok/s fully in VRAM, which helps anchor the headline number in a broader benchmark set instead of an isolated anecdote.
Why the build is faster than expected
The author’s explanation is that the speedup is not coming from one trick alone. The post credits the PCIe switch PIX topology, SGLang b12x MoE kernels, NEXTN speculative decoding, custom multi-GPU allreduce behavior, and a modelopt_fp4 checkpoint that works with those kernels. The public results file adds a useful comparison: the same repo reports 48.7 GB/s P2P bandwidth on the PLX-based setup versus 27.9 GB/s on a TRX40 path, and summarizes the best 122B result as +68% over the compared TRX40 baseline. In other words, the post is really a topology and software-stack story, not just a new GPU story.
What to take away
The caveat is that this is still a community benchmark centered on single-user decode throughput, not a universal application-level score. The thread itself notes that TTFT rises with context length even when decode speed stays close to 198 tok/s, with examples such as 4K context at 1.8 seconds and 150K context at 23.3 seconds. Even so, the post matters because it is more reproducible than most social-media performance claims: the hardware list is explicit, the software stack is named, and raw JSON artifacts are public. For local inference practitioners, that makes the thread useful not just as hype but as a concrete tuning reference for how far dual Blackwell systems can be pushed today.
Reddit discussion thread · Benchmark results · Raw verification JSON
Related Articles
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
Comments (0)
No comments yet. Be the first to comment!