LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels
Original: I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found. View original →
The benchmark claim in one sentence
On March 12, 2026, a detailed r/LocalLLaMA post argued that workstation-class Blackwell users are hitting a software ceiling, not a hardware one. The author tested 16 configurations for nvidia/Qwen3.5-397B-A17B-NVFP4 on a 4x RTX PRO 6000 setup with 96 GB per GPU, PCIe Gen5, no NVLink, and WSL2. Their reported best sustained decode result was 50.5 tok/s using Marlin W4A16 with tensor parallel size 4 and Multi-Token Prediction disabled.
That figure matters because the post is explicitly pushing back on much larger numbers circulating elsewhere for the same class of hardware. The author’s case is that those higher numbers either rely on unstable paths or count speculative tokens in a way that overstates delivered throughput.
Why the native FP4 path is the problem
The technical claim is that CUTLASS grouped GEMM kernels for NVFP4 MoE inference are effectively broken on SM120, the desktop/workstation Blackwell variant used by RTX PRO 6000 cards. In the post, native CUTLASS and FlashInfer-backed paths either produced garbage output, skipped large sets of fast tactics, or fell back to slower routes. The author says dense FP4 works, but the grouped GEMM path used for MoE experts does not behave correctly on this architecture. They link that diagnosis to CUTLASS issue #3096, which documents failures and corrupted output on SM120 NVFP4 MoE runs.
That is an important distinction for local inference builders. If the kernel path for expert routing is immature, raw hardware capability alone does not translate into usable MoE throughput.
What worked and what did not
The post’s configuration table is unusually concrete. Marlin without MTP won at 50.5 tok/s, while Marlin with MTP reportedly dropped to around 39.6 tok/s because speculative decoding acceptance was too low to offset overhead. Expert parallel over PCIe was effectively unusable at 1.4 to 2.6 tok/s. Some CUTLASS Docker runs reached the 20s or low 40s, but only after skipping large sets of fast kernels. The practical recommendation was simple: force Marlin, disable MTP, keep CUDA graphs enabled, and avoid expert parallel on PCIe.
The same write-up also says getting to that point required multiple patches across FlashInfer and vLLM, with upstream work linked in FlashInfer PR #2725 and vLLM PR #36453. Whether or not every interpretation in the thread holds, the benchmark is useful because it exposes a real implementation gap between vendor marketing around FP4 inference and what local workstation users can currently achieve.
Why the thread matters
The broader lesson is that large-model local inference is now bottlenecked as much by kernel readiness and architecture-specific support as by model weights or memory size. For teams evaluating Blackwell workstations, the post suggests that “can load the model” and “can use the intended fast path” are no longer the same question.
Primary references: CUTLASS issue #3096, FlashInfer PR #2725, vLLM PR #36453. Community discussion: r/LocalLLaMA.
Related Articles
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A high-scoring LocalLLaMA post benchmarked Qwen3.5-27B Q4 GGUF variants against BF16, separating “closest-to-baseline” choices from “best efficiency” picks for constrained VRAM setups.
Alibaba launched Qwen3.5, a 397B-parameter open-weight multimodal model supporting 201 languages. The company claims it outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 on benchmarks, while costing 60% less than its predecessor.
Comments (0)
No comments yet. Be the first to comment!