AI-generated CUDA kernels passed the benchmark, then broke real training

A r/MachineLearning post described a failure mode that should make performance-engineering benchmarks uneasy. Several AI-generated CUDA kernels ranked well on NVIDIA’s SOL-ExecBench, but when the authors tried using top submissions inside production-like workloads, some broke in ways that were difficult to diagnose. One fused embedding-gradient plus RMSNorm backward kernel passed the benchmark verifier, then made a small transformer’s training loss diverge.

The bug was not a simple wrong answer on a test case. The embedding-gradient part accumulated in bf16 instead of fp32. With uniformly sampled tokens, gradient contributions were spread broadly enough that bf16 precision appeared acceptable. With real text, frequent token IDs received thousands of contributions. Smaller values rounded away against the growing accumulator, causing high-frequency embedding rows to drift. AdamW masked the issue through per-parameter normalization, so the same kernel looked fine under a different optimizer.

The community discussion focused on the limits of “passes the verifier.” A top comment noted that bf16 is common enough that many practitioners might gloss over the detail. Another argued that optimizer and dataset sensitivity should be part of kernel testing. That is the dangerous part: the symptom looks like a failed research idea, a bad dataset, or a weak architecture before it looks like a kernel bug.

AI-generated performance code is getting fast enough to matter. The next problem is whether it is correct under the messy distributions where models are actually trained and served. Benchmarks that reward speed will need broader verification, especially for kernels that sit inside repeated training steps and can quietly bias results.

Reddit discussion · Related research post

AI-generated CUDA kernels passed the benchmark, then broke real training

Related Articles

GitHub Copilot CLI turns Markdown into repeatable custom agents

The Log is the Agent reframes agent runtime around event sourcing

Clean code may not make coding agents pass more, but it makes them wander less