LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

The bottleneck the community was chasing

A March 14, 2026 post in r/LocalLLaMA described a concrete fix for a hardware-specific inference problem on SM120 Blackwell workstation GPUs such as the RTX PRO 6000. The author argues that block-scaled MoE GEMM paths for NVFP4 models were effectively broken on this class of hardware because the available tile shapes either overflowed shared memory at runtime or failed to compile cleanly. That left large models such as Qwen3.5-397B-A17B-NVFP4 stuck on slower fallback kernels.

The linked FlashInfer PR #2786 is explicit about the proposed remedy: add K=64 tile shapes for SM120 and fix a scale-factor layout mismatch that blocked K=64 compilation. The PR summary says this delivered roughly 2x single-user decode throughput on the submitter’s RTX PRO 6000 setup, while also improving higher-concurrency system throughput.

What the benchmark numbers actually mean

The Reddit write-up walked through the full tuning path, from 55 tok/s under WSL2 to 119 tok/s on native Linux, then 142 tok/s after driver and configuration changes, and finally 283 tok/s after the custom K=64 kernel path. Importantly, the author also added a methodological caveat: the highest 283 tok/s figure was measured with thinking mode enabled on a short prompt, which inflates throughput because Multi-Token Prediction accepts highly predictable <think> tokens. For more realistic prompts with thinking disabled, the same post says usable single-user throughput is closer to about 130-136 tok/s.

That clarification matters. The story is not that workstation Blackwell suddenly matches every datacenter benchmark, but that a community patch may remove an avoidable architectural penalty and recover a meaningful chunk of lost performance on local hardware.

Why LocalLLaMA cared

This is exactly the sort of issue LocalLLaMA values: not a vague “model got faster” claim, but a reproducible explanation tied to shared-memory limits, CUTLASS tile selection, and an upstreamable patch. The PR body says the fix targets SM120 GPUs with 99KB shared memory and adds support for K=64 block-scaled MoE GEMM paths that fit those constraints. If merged and propagated through the stack, that helps workstation users running Qwen3.5-397B, DeepSeek-style MoE models, and other NVFP4 workloads on local Blackwell systems.

Because the numbers are self-reported and the PR remains open, the safest interpretation is directional rather than final. But as a community-sourced engineering story, it is a strong example of how local AI performance is increasingly limited by kernel maturity and system integration, not just model weights.

Primary sources: FlashInfer PR #2786, CUTLASS issue #3096. Community discussion: r/LocalLLaMA.

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

The bottleneck the community was chasing

What the benchmark numbers actually mean

Why LocalLLaMA cared

Related Articles

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

Related Articles

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels
LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story
LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell
LLM Reddit Apr 10, 2026 2 min read