LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
Original: 55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell View original →
The bottleneck the community was chasing
A March 14, 2026 post in r/LocalLLaMA described a concrete fix for a hardware-specific inference problem on SM120 Blackwell workstation GPUs such as the RTX PRO 6000. The author argues that block-scaled MoE GEMM paths for NVFP4 models were effectively broken on this class of hardware because the available tile shapes either overflowed shared memory at runtime or failed to compile cleanly. That left large models such as Qwen3.5-397B-A17B-NVFP4 stuck on slower fallback kernels.
The linked FlashInfer PR #2786 is explicit about the proposed remedy: add K=64 tile shapes for SM120 and fix a scale-factor layout mismatch that blocked K=64 compilation. The PR summary says this delivered roughly 2x single-user decode throughput on the submitter’s RTX PRO 6000 setup, while also improving higher-concurrency system throughput.
What the benchmark numbers actually mean
The Reddit write-up walked through the full tuning path, from 55 tok/s under WSL2 to 119 tok/s on native Linux, then 142 tok/s after driver and configuration changes, and finally 283 tok/s after the custom K=64 kernel path. Importantly, the author also added a methodological caveat: the highest 283 tok/s figure was measured with thinking mode enabled on a short prompt, which inflates throughput because Multi-Token Prediction accepts highly predictable <think> tokens. For more realistic prompts with thinking disabled, the same post says usable single-user throughput is closer to about 130-136 tok/s.
That clarification matters. The story is not that workstation Blackwell suddenly matches every datacenter benchmark, but that a community patch may remove an avoidable architectural penalty and recover a meaningful chunk of lost performance on local hardware.
Why LocalLLaMA cared
This is exactly the sort of issue LocalLLaMA values: not a vague “model got faster” claim, but a reproducible explanation tied to shared-memory limits, CUTLASS tile selection, and an upstreamable patch. The PR body says the fix targets SM120 GPUs with 99KB shared memory and adds support for K=64 block-scaled MoE GEMM paths that fit those constraints. If merged and propagated through the stack, that helps workstation users running Qwen3.5-397B, DeepSeek-style MoE models, and other NVFP4 workloads on local Blackwell systems.
Because the numbers are self-reported and the PR remains open, the safest interpretation is directional rather than final. But as a community-sourced engineering story, it is a strong example of how local AI performance is increasingly limited by kernel maturity and system integration, not just model weights.
Primary sources: FlashInfer PR #2786, CUTLASS issue #3096. Community discussion: r/LocalLLaMA.
Related Articles
A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
Comments (0)
No comments yet. Be the first to comment!