LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

Original: I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found. View original →

Read in other languages: 한국어日本語
LLM Mar 16, 2026 By Insights AI (Reddit) 2 min read 1 views Source

The benchmark claim in one sentence

On March 12, 2026, a detailed r/LocalLLaMA post argued that workstation-class Blackwell users are hitting a software ceiling, not a hardware one. The author tested 16 configurations for nvidia/Qwen3.5-397B-A17B-NVFP4 on a 4x RTX PRO 6000 setup with 96 GB per GPU, PCIe Gen5, no NVLink, and WSL2. Their reported best sustained decode result was 50.5 tok/s using Marlin W4A16 with tensor parallel size 4 and Multi-Token Prediction disabled.

That figure matters because the post is explicitly pushing back on much larger numbers circulating elsewhere for the same class of hardware. The author’s case is that those higher numbers either rely on unstable paths or count speculative tokens in a way that overstates delivered throughput.

Why the native FP4 path is the problem

The technical claim is that CUTLASS grouped GEMM kernels for NVFP4 MoE inference are effectively broken on SM120, the desktop/workstation Blackwell variant used by RTX PRO 6000 cards. In the post, native CUTLASS and FlashInfer-backed paths either produced garbage output, skipped large sets of fast tactics, or fell back to slower routes. The author says dense FP4 works, but the grouped GEMM path used for MoE experts does not behave correctly on this architecture. They link that diagnosis to CUTLASS issue #3096, which documents failures and corrupted output on SM120 NVFP4 MoE runs.

That is an important distinction for local inference builders. If the kernel path for expert routing is immature, raw hardware capability alone does not translate into usable MoE throughput.

What worked and what did not

The post’s configuration table is unusually concrete. Marlin without MTP won at 50.5 tok/s, while Marlin with MTP reportedly dropped to around 39.6 tok/s because speculative decoding acceptance was too low to offset overhead. Expert parallel over PCIe was effectively unusable at 1.4 to 2.6 tok/s. Some CUTLASS Docker runs reached the 20s or low 40s, but only after skipping large sets of fast kernels. The practical recommendation was simple: force Marlin, disable MTP, keep CUDA graphs enabled, and avoid expert parallel on PCIe.

The same write-up also says getting to that point required multiple patches across FlashInfer and vLLM, with upstream work linked in FlashInfer PR #2725 and vLLM PR #36453. Whether or not every interpretation in the thread holds, the benchmark is useful because it exposes a real implementation gap between vendor marketing around FP4 inference and what local workstation users can currently achieve.

Why the thread matters

The broader lesson is that large-model local inference is now bottlenecked as much by kernel readiness and architecture-specific support as by model weights or memory size. For teams evaluating Blackwell workstations, the post suggests that “can load the model” and “can use the intended fast path” are no longer the same question.

Primary references: CUTLASS issue #3096, FlashInfer PR #2725, vLLM PR #36453. Community discussion: r/LocalLLaMA.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.