Skip to content

AI-generated CUDA kernels passed the benchmark, then broke real training

Original: AI-generated CUDA kernels silently break training and inference [R] View original →

Read in other languages: 한국어日本語
LLM May 28, 2026 By Insights AI (Reddit) 1 min read 1 views Source

A r/MachineLearning post described a failure mode that should make performance-engineering benchmarks uneasy. Several AI-generated CUDA kernels ranked well on NVIDIA’s SOL-ExecBench, but when the authors tried using top submissions inside production-like workloads, some broke in ways that were difficult to diagnose. One fused embedding-gradient plus RMSNorm backward kernel passed the benchmark verifier, then made a small transformer’s training loss diverge.

The bug was not a simple wrong answer on a test case. The embedding-gradient part accumulated in bf16 instead of fp32. With uniformly sampled tokens, gradient contributions were spread broadly enough that bf16 precision appeared acceptable. With real text, frequent token IDs received thousands of contributions. Smaller values rounded away against the growing accumulator, causing high-frequency embedding rows to drift. AdamW masked the issue through per-parameter normalization, so the same kernel looked fine under a different optimizer.

The community discussion focused on the limits of “passes the verifier.” A top comment noted that bf16 is common enough that many practitioners might gloss over the detail. Another argued that optimizer and dataset sensitivity should be part of kernel testing. That is the dangerous part: the symptom looks like a failed research idea, a bad dataset, or a weak architecture before it looks like a kernel bug.

AI-generated performance code is getting fast enough to matter. The next problem is whether it is correct under the messy distributions where models are actually trained and served. Benchmarks that reward speed will need broader verification, especially for kernels that sit inside repeated training steps and can quietly bias results.

Reddit discussion · Related research post

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment