What Reddit is claiming

A MachineLearning post on April 9, 2026 argues that cuBLAS is dispatching a poor kernel for batched FP32 matrix multiplication on RTX 5090 hardware. The author says the issue appears across workloads from 256x256 up to 8192x8192x8, and that the affected path uses only about 40% of the available compute on tested RTX cards. The report says the problem was reproduced with CUDA 13.2.51, cuBLAS 13.3.0, and NVIDIA driver 595.58.03.

The evidence in the post is unusually concrete for a forum thread. The author compares cuBLAS with a custom tensor memory accelerator double-buffer kernel and reports that the custom path outperforms cuBLAS by roughly 20% to 70% on many batched workloads. At 2048 and 4096 sizes, for example, the Reddit tables show the custom kernel reaching around 155% to 170% of cuBLAS throughput depending on batch size.

Why the comparison matters

The more important claim is that the baseline itself may be mis-selected. The same post says cuBLAS uses more appropriate kernels on other NVIDIA products. According to the published profiling summary, an RTX Pro 6000 reaches about 73% FMA utilization and an H200 reaches about 82%, while the RTX path appears to fall much lower. The author says the custom kernel still trails a properly chosen Pro or H200 kernel in some cases, which is what makes the bug report notable: the issue may be kernel selection on RTX, not some magical optimization impossible for cuBLAS to match.

The thread also points to a Medium deep dive, full Nsight Compute profiling data, and a GitHub repository with reproduction scripts and benchmark reports. That makes the post more useful than a generic complaint about vendor libraries. Other engineers can inspect the workload mix, compare batch sizes, and test whether the same regression shows up on additional non-Pro RTX GPUs.

Why AI infrastructure teams should care

For AI and inference teams, batched GEMM behavior matters far beyond synthetic benchmarks. The difference between a library picking the right kernel and the wrong one can ripple into throughput, latency, and hardware purchasing decisions. If the Reddit report holds up, the practical lesson is uncomfortable: consumer RTX cards may leave a surprising amount of FP32 batched performance on the table unless developers profile the exact kernel path they are getting.

That is why this post is interesting even before NVIDIA responds. It turns a vague my-GPU-feels-slow complaint into a reproducible systems question about dispatch logic, architecture segmentation, and whether library defaults are actually aligned with the workloads local AI builders are running.

Source links: Reddit thread, Linked article, Benchmark report.

#matmul

Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads