Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads
Original: [D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D] View original →
What Reddit is claiming
A MachineLearning post on April 9, 2026 argues that cuBLAS is dispatching a poor kernel for batched FP32 matrix multiplication on RTX 5090 hardware. The author says the issue appears across workloads from 256x256 up to 8192x8192x8, and that the affected path uses only about 40% of the available compute on tested RTX cards. The report says the problem was reproduced with CUDA 13.2.51, cuBLAS 13.3.0, and NVIDIA driver 595.58.03.
The evidence in the post is unusually concrete for a forum thread. The author compares cuBLAS with a custom tensor memory accelerator double-buffer kernel and reports that the custom path outperforms cuBLAS by roughly 20% to 70% on many batched workloads. At 2048 and 4096 sizes, for example, the Reddit tables show the custom kernel reaching around 155% to 170% of cuBLAS throughput depending on batch size.
Why the comparison matters
The more important claim is that the baseline itself may be mis-selected. The same post says cuBLAS uses more appropriate kernels on other NVIDIA products. According to the published profiling summary, an RTX Pro 6000 reaches about 73% FMA utilization and an H200 reaches about 82%, while the RTX path appears to fall much lower. The author says the custom kernel still trails a properly chosen Pro or H200 kernel in some cases, which is what makes the bug report notable: the issue may be kernel selection on RTX, not some magical optimization impossible for cuBLAS to match.
The thread also points to a Medium deep dive, full Nsight Compute profiling data, and a GitHub repository with reproduction scripts and benchmark reports. That makes the post more useful than a generic complaint about vendor libraries. Other engineers can inspect the workload mix, compare batch sizes, and test whether the same regression shows up on additional non-Pro RTX GPUs.
Why AI infrastructure teams should care
For AI and inference teams, batched GEMM behavior matters far beyond synthetic benchmarks. The difference between a library picking the right kernel and the wrong one can ripple into throughput, latency, and hardware purchasing decisions. If the Reddit report holds up, the practical lesson is uncomfortable: consumer RTX cards may leave a surprising amount of FP32 batched performance on the table unless developers profile the exact kernel path they are getting.
That is why this post is interesting even before NVIDIA responds. It turns a vague my-GPU-feels-slow complaint into a reproducible systems question about dispatch logic, architecture segmentation, and whether library defaults are actually aligned with the workloads local AI builders are running.
Source links: Reddit thread, Linked article, Benchmark report.
Related Articles
A r/MachineLearning post and linked benchmark writeup argue that batched FP32 SGEMM on RTX 5090 is hitting an inefficient cuBLAS path, leaving much of the GPU idle.
NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.
Flash-KMeans is an arXiv paper submitted on 10 Mar 2026 that targets two concrete GPU bottlenecks in Exact K-Means: materializing the N x K distance matrix in HBM and atomic contention during centroid updates. The Hacker News thread reached 180 points and 14 comments because systems-minded readers immediately connected the work to FlashAttention-style dataflow optimization, practical deployment questions, and the broader shift of K-Means from offline preprocessing to an online AI primitive.
Comments (0)
No comments yet. Be the first to comment!