LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

Original: 55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell View original →

Read in other languages: 한국어日本語
LLM Mar 15, 2026 By Insights AI (Reddit) 2 min read 2 views Source

The bottleneck the community was chasing

A March 14, 2026 post in r/LocalLLaMA described a concrete fix for a hardware-specific inference problem on SM120 Blackwell workstation GPUs such as the RTX PRO 6000. The author argues that block-scaled MoE GEMM paths for NVFP4 models were effectively broken on this class of hardware because the available tile shapes either overflowed shared memory at runtime or failed to compile cleanly. That left large models such as Qwen3.5-397B-A17B-NVFP4 stuck on slower fallback kernels.

The linked FlashInfer PR #2786 is explicit about the proposed remedy: add K=64 tile shapes for SM120 and fix a scale-factor layout mismatch that blocked K=64 compilation. The PR summary says this delivered roughly 2x single-user decode throughput on the submitter’s RTX PRO 6000 setup, while also improving higher-concurrency system throughput.

What the benchmark numbers actually mean

The Reddit write-up walked through the full tuning path, from 55 tok/s under WSL2 to 119 tok/s on native Linux, then 142 tok/s after driver and configuration changes, and finally 283 tok/s after the custom K=64 kernel path. Importantly, the author also added a methodological caveat: the highest 283 tok/s figure was measured with thinking mode enabled on a short prompt, which inflates throughput because Multi-Token Prediction accepts highly predictable <think> tokens. For more realistic prompts with thinking disabled, the same post says usable single-user throughput is closer to about 130-136 tok/s.

That clarification matters. The story is not that workstation Blackwell suddenly matches every datacenter benchmark, but that a community patch may remove an avoidable architectural penalty and recover a meaningful chunk of lost performance on local hardware.

Why LocalLLaMA cared

This is exactly the sort of issue LocalLLaMA values: not a vague “model got faster” claim, but a reproducible explanation tied to shared-memory limits, CUTLASS tile selection, and an upstreamable patch. The PR body says the fix targets SM120 GPUs with 99KB shared memory and adds support for K=64 block-scaled MoE GEMM paths that fit those constraints. If merged and propagated through the stack, that helps workstation users running Qwen3.5-397B, DeepSeek-style MoE models, and other NVFP4 workloads on local Blackwell systems.

Because the numbers are self-reported and the PR remains open, the safest interpretation is directional rather than final. But as a community-sourced engineering story, it is a strong example of how local AI performance is increasingly limited by kernel maturity and system integration, not just model weights.

Primary sources: FlashInfer PR #2786, CUTLASS issue #3096. Community discussion: r/LocalLLaMA.

Share: Long

Related Articles

LLM Reddit 3d ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.