Reddit post flags a likely FP32 cuBLAS dispatch problem on RTX 5090
Original: [D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] View original →
A community benchmark that caught attention
A 2026-04-10 post on r/MachineLearning argued that batched FP32 SGEMM on the RTX 5090 may be hitting a badly chosen cuBLAS kernel path. When reviewed, the thread had a score of 93 and 6 comments. The Reddit post summarized the measurements, while a linked Medium article expanded the profiling details. The author listed the test stack as CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03.
The core claim is straightforward. A custom TMA-based SGEMM kernel reportedly outperformed the default batched cuBLAS path on RTX 5090 by roughly 1.4x to 1.7x, and ncu profiling suggested that the 5090 was being pinned to the same small simt_sgemm_128x32_8x5 kernel across a very wide workload range. In the linked writeup, that path is described as running at roughly 33% to 42% FMA pipe utilization. By comparison, the same analysis said an RTX PRO 6000 reached about 73% and an H200 about 82% with different kernel families.
- The posted benchmark showed 46% to 70% gains for the custom kernel on batched sizes from 1024 to 8192.
- The author argued that this looks less like a bad threshold and more like missing escalation logic in the RTX 5090 batched FP32 dispatcher.
- The Medium article also said
cuBLASLtstays on a SIMT-oriented path for strict FP32, whileFAST_TF32andBF16are faster only because they accept lower input precision.
This matters well beyond a single benchmark screenshot. Matrix multiplication is a central primitive for modern AI workloads, so a weaker dispatch path on consumer RTX cards can translate into avoidable cost and latency for local training, inference, and benchmarking. The post also carried a second message for systems engineers: the author used TMA to build a relatively compact kernel that still landed close to efficient vendor implementations on better-tuned hardware, which suggests that the gap is not only about silicon but also about software routing.
At the same time, this is still a community finding rather than an NVIDIA-confirmed bug bulletin. The evidence is the author’s benchmark, profiler traces, and linked writeup. Reddit discussion was still limited when reviewed. The top comment asked why the results were posted on Reddit instead of NVIDIA’s own forums, while another simply praised the investigation. So the safest interpretation is that the community has surfaced a technically credible performance question that now deserves broader reproduction.
Source links: Reddit thread, Medium benchmark writeup, DeploDock repository.
Related Articles
NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.
A DGX Spark owner on LocalLLaMA argues that NVFP4 remains far from production-ready, prompting a broader debate about whether NVIDIA's premium local AI box still justifies its price.
NVIDIA and Thinking Machines Lab said on March 10, 2026 that they will deploy at least one gigawatt of next-generation NVIDIA Vera Rubin systems under a multiyear partnership. The agreement also covers co-design of training and serving systems plus an NVIDIA investment in Thinking Machines Lab.
Comments (0)
No comments yet. Be the first to comment!