Reddit post flags a likely FP32 cuBLAS dispatch problem on RTX 5090

A community benchmark that caught attention

A 2026-04-10 post on r/MachineLearning argued that batched FP32 SGEMM on the RTX 5090 may be hitting a badly chosen cuBLAS kernel path. When reviewed, the thread had a score of 93 and 6 comments. The Reddit post summarized the measurements, while a linked Medium article expanded the profiling details. The author listed the test stack as CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03.

The core claim is straightforward. A custom TMA-based SGEMM kernel reportedly outperformed the default batched cuBLAS path on RTX 5090 by roughly 1.4x to 1.7x, and ncu profiling suggested that the 5090 was being pinned to the same small simt_sgemm_128x32_8x5 kernel across a very wide workload range. In the linked writeup, that path is described as running at roughly 33% to 42% FMA pipe utilization. By comparison, the same analysis said an RTX PRO 6000 reached about 73% and an H200 about 82% with different kernel families.

The posted benchmark showed 46% to 70% gains for the custom kernel on batched sizes from 1024 to 8192.
The author argued that this looks less like a bad threshold and more like missing escalation logic in the RTX 5090 batched FP32 dispatcher.
The Medium article also said cuBLASLt stays on a SIMT-oriented path for strict FP32, while FAST_TF32 and BF16 are faster only because they accept lower input precision.

This matters well beyond a single benchmark screenshot. Matrix multiplication is a central primitive for modern AI workloads, so a weaker dispatch path on consumer RTX cards can translate into avoidable cost and latency for local training, inference, and benchmarking. The post also carried a second message for systems engineers: the author used TMA to build a relatively compact kernel that still landed close to efficient vendor implementations on better-tuned hardware, which suggests that the gap is not only about silicon but also about software routing.

At the same time, this is still a community finding rather than an NVIDIA-confirmed bug bulletin. The evidence is the author’s benchmark, profiler traces, and linked writeup. Reddit discussion was still limited when reviewed. The top comment asked why the results were posted on Reddit instead of NVIDIA’s own forums, while another simply praised the investigation. So the safest interpretation is that the community has surfaced a technically credible performance question that now deserves broader reproduction.

Source links: Reddit thread, Medium benchmark writeup, DeploDock repository.

Reddit post flags a likely FP32 cuBLAS dispatch problem on RTX 5090

A community benchmark that caught attention

Related Articles

Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads

NVIDIA puts 4B Cosmos 3 Edge at the center of local physical AI

NVIDIA Vera Rubin NVL72: 5x Blackwell Performance and 10x Lower Inference Cost

Related Articles

Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads
AI Reddit Apr 11, 2026 2 min read

NVIDIA puts 4B Cosmos 3 Edge at the center of local physical AI

NVIDIA Vera Rubin NVL72: 5x Blackwell Performance and 10x Lower Inference Cost
AI Mar 1, 2026 1 min read