AgentPerf reframes AI infra: GB300 serves 20x more coding agents per MW
Original: NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark View original →
AI infrastructure is getting a benchmark built around the thing agent products actually stress: many long-running sessions with tool calls, variable context, and strict latency targets. NVIDIA’s June 12, 2026 AA-AgentPerf write-up puts concurrent coding agents per megawatt at the center of the comparison.
AA-AgentPerf, created by Artificial Analysis, is described in the NVIDIA Technical Blog as an open multi-vendor hardware benchmark for agentic workloads. Instead of asking only how fast a system emits tokens, it measures how many concurrent AI agents an inference system can support while meeting model-specific service-level objectives for output speed and time to first token.
The benchmark uses prerecorded agentic coding trajectories that mix LLM calls and tool calls across public-repository tasks. Request lengths range from 5K to 131K tokens, with a mean of roughly 27K. Tool-call latency is simulated with a representative CPU-side baseline using a one-second median delay, and the test set remains private to reduce benchmark-specific tuning.
The launch result NVIDIA highlights is the GB300 NVL72 versus H200 comparison on DeepSeek-V4-Pro. For the SLO=30 configuration, NVIDIA lists 61.4K concurrent agents per megawatt and 57.5 concurrent agents per GPU on GB300 NVL72. H200 is listed at 2.6K per megawatt and 1.4 per GPU. NVIDIA summarizes that gap as up to 20x higher agentic coding performance than the previous generation.
The broader signal is that agent infrastructure will be judged differently from chat infrastructure. A coding agent may call tools repeatedly, branch unpredictably, and keep a user waiting across many turns. For data-center buyers, the useful question becomes how many sessions stay inside the promised service level for a fixed power budget, not just which accelerator wins a short-token throughput chart.
Related Articles
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.
NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.