LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
Original: 55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell View original →
The bottleneck the community was chasing
A March 14, 2026 post in r/LocalLLaMA described a concrete fix for a hardware-specific inference problem on SM120 Blackwell workstation GPUs such as the RTX PRO 6000. The author argues that block-scaled MoE GEMM paths for NVFP4 models were effectively broken on this class of hardware because the available tile shapes either overflowed shared memory at runtime or failed to compile cleanly. That left large models such as Qwen3.5-397B-A17B-NVFP4 stuck on slower fallback kernels.
The linked FlashInfer PR #2786 is explicit about the proposed remedy: add K=64 tile shapes for SM120 and fix a scale-factor layout mismatch that blocked K=64 compilation. The PR summary says this delivered roughly 2x single-user decode throughput on the submitter’s RTX PRO 6000 setup, while also improving higher-concurrency system throughput.
What the benchmark numbers actually mean
The Reddit write-up walked through the full tuning path, from 55 tok/s under WSL2 to 119 tok/s on native Linux, then 142 tok/s after driver and configuration changes, and finally 283 tok/s after the custom K=64 kernel path. Importantly, the author also added a methodological caveat: the highest 283 tok/s figure was measured with thinking mode enabled on a short prompt, which inflates throughput because Multi-Token Prediction accepts highly predictable <think> tokens. For more realistic prompts with thinking disabled, the same post says usable single-user throughput is closer to about 130-136 tok/s.
That clarification matters. The story is not that workstation Blackwell suddenly matches every datacenter benchmark, but that a community patch may remove an avoidable architectural penalty and recover a meaningful chunk of lost performance on local hardware.
Why LocalLLaMA cared
This is exactly the sort of issue LocalLLaMA values: not a vague “model got faster” claim, but a reproducible explanation tied to shared-memory limits, CUTLASS tile selection, and an upstreamable patch. The PR body says the fix targets SM120 GPUs with 99KB shared memory and adds support for K=64 block-scaled MoE GEMM paths that fit those constraints. If merged and propagated through the stack, that helps workstation users running Qwen3.5-397B, DeepSeek-style MoE models, and other NVFP4 workloads on local Blackwell systems.
Because the numbers are self-reported and the PR remains open, the safest interpretation is directional rather than final. But as a community-sourced engineering story, it is a strong example of how local AI performance is increasingly limited by kernel maturity and system integration, not just model weights.
Primary sources: FlashInfer PR #2786, CUTLASS issue #3096. Community discussion: r/LocalLLaMA.
Related Articles
LocalLLaMA reacted to this post because it brought hard numbers, not vendor marketing: a dual RTX 5060 Ti 16GB setup pushing Qwen3.6 27B to roughly 60 tok/s with a 204k context window.
A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.
A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
Comments (0)
No comments yet. Be the first to comment!