#cuda

AI sources.twitter Apr 16, 2026 1 min read

Cursor agents lift NVIDIA Blackwell CUDA kernels by 38%

Coding agents are being tested on GPU performance work, not just app scaffolding. Cursor says its NVIDIA collaboration produced a 38% geomean speedup across 235 CUDA kernel problems in three weeks.

#ai-agents #cuda #nvidia

AI Hacker News Apr 13, 2026 2 min read

Hacker News spotlights AMD's step-by-step ROCm strategy against CUDA's moat

A front-page Hacker News discussion resurfaced an EE Times interview outlining how AMD wants ROCm, Triton, OneROCm, and an open-source release model to chip away at CUDA dependence. The real test is not a headline compatibility claim, but whether stacks like vLLM and SGLang work in a boring, dependable way.

#rocm #cuda #amd

AI Reddit Apr 11, 2026 2 min read

Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads

A MachineLearning thread argues that cuBLAS may be choosing an inefficient kernel for batched FP32 matrix multiplication on RTX 5090. The significance is not just the claimed slowdown, but the fact that the post includes reproducible benchmark tables, profiling notes, and linked repro material.

#cublas #rtx-5090 #cuda

AI Reddit Apr 11, 2026 2 min read

Reddit post flags a likely FP32 cuBLAS dispatch problem on RTX 5090

A r/MachineLearning post and linked benchmark writeup argue that batched FP32 SGEMM on RTX 5090 is hitting an inefficient cuBLAS path, leaving much of the GPU idle.

#cuda #cublas #gpu

LLM Reddit Mar 22, 2026 2 min read

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

A high-signal r/LocalLLaMA benchmark post said moving Qwen 3.5 27B from mainline llama.cpp to ik_llama.cpp raised prompt evaluation from about 43 tok/sec to 1,122 tok/sec on a Blackwell RTX PRO 4000, with generation climbing from 7.5 tok/sec to 26 tok/sec.

#llama.cpp #qwen #local-llm

AI sources.twitter Mar 20, 2026 2 min read

NVIDIA launches SOL-ExecBench to measure GPU kernel optimization against hardware limits

NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.

#nvidia #cuda #benchmarking

LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Pushes GreenBoost, a Linux Driver That Extends NVIDIA GPU Memory with RAM and NVMe

A LocalLLaMA thread amplified Phoronix coverage of GreenBoost, an experimental GPLv2 Linux module that adds a multi-tier memory path for NVIDIA GPUs. The design pairs a kernel module with a CUDA shim so large allocations can spill from limited on-card vRAM into pinned system RAM and NVMe-backed storage without modifying CUDA applications.

#nvidia #vram #cuda

LLM Reddit Mar 6, 2026 2 min read

CUDA Agent Report Claims Strong KernelBench Gains Through Agentic RL

A Reddit post in r/singularity highlighted CUDA Agent, a ByteDance Seed and Tsinghua AIR project that reports high pass rates and speedups over torch.compile on KernelBench.

#cuda #agentic-rl #kernelbench

LLM Hacker News Feb 18, 2026 1 min read

BarraCUDA Draws HN Attention: A C99 CUDA Compiler That Emits AMD GFX11 Binaries Without LLVM

A high-scoring Hacker News post highlighted BarraCUDA, an open-source C99 compiler that translates CUDA `.cu` code directly into AMD GFX11 `.hsaco` binaries with no LLVM dependency.

#cuda #amd-gpu #compiler