Coding agents are being tested on GPU performance work, not just app scaffolding. Cursor says its NVIDIA collaboration produced a 38% geomean speedup across 235 CUDA kernel problems in three weeks.
#cuda
RSS FeedA front-page Hacker News discussion resurfaced an EE Times interview outlining how AMD wants ROCm, Triton, OneROCm, and an open-source release model to chip away at CUDA dependence. The real test is not a headline compatibility claim, but whether stacks like vLLM and SGLang work in a boring, dependable way.
A MachineLearning thread argues that cuBLAS may be choosing an inefficient kernel for batched FP32 matrix multiplication on RTX 5090. The significance is not just the claimed slowdown, but the fact that the post includes reproducible benchmark tables, profiling notes, and linked repro material.
A r/MachineLearning post and linked benchmark writeup argue that batched FP32 SGEMM on RTX 5090 is hitting an inefficient cuBLAS path, leaving much of the GPU idle.
A high-signal r/LocalLLaMA benchmark post said moving Qwen 3.5 27B from mainline llama.cpp to ik_llama.cpp raised prompt evaluation from about 43 tok/sec to 1,122 tok/sec on a Blackwell RTX PRO 4000, with generation climbing from 7.5 tok/sec to 26 tok/sec.
NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.
A LocalLLaMA thread amplified Phoronix coverage of GreenBoost, an experimental GPLv2 Linux module that adds a multi-tier memory path for NVIDIA GPUs. The design pairs a kernel module with a CUDA shim so large allocations can spill from limited on-card vRAM into pinned system RAM and NVMe-backed storage without modifying CUDA applications.
A Reddit post in r/singularity highlighted CUDA Agent, a ByteDance Seed and Tsinghua AIR project that reports high pass rates and speedups over torch.compile on KernelBench.
A high-scoring Hacker News post highlighted BarraCUDA, an open-source C99 compiler that translates CUDA `.cu` code directly into AMD GFX11 `.hsaco` binaries with no LLVM dependency.