AgentPerf reframes AI infra: GB300 serves 20x more coding agents per MW

AI infrastructure is getting a benchmark built around the thing agent products actually stress: many long-running sessions with tool calls, variable context, and strict latency targets. NVIDIA’s June 12, 2026 AA-AgentPerf write-up puts concurrent coding agents per megawatt at the center of the comparison.

AA-AgentPerf, created by Artificial Analysis, is described in the NVIDIA Technical Blog as an open multi-vendor hardware benchmark for agentic workloads. Instead of asking only how fast a system emits tokens, it measures how many concurrent AI agents an inference system can support while meeting model-specific service-level objectives for output speed and time to first token.

The benchmark uses prerecorded agentic coding trajectories that mix LLM calls and tool calls across public-repository tasks. Request lengths range from 5K to 131K tokens, with a mean of roughly 27K. Tool-call latency is simulated with a representative CPU-side baseline using a one-second median delay, and the test set remains private to reduce benchmark-specific tuning.

The launch result NVIDIA highlights is the GB300 NVL72 versus H200 comparison on DeepSeek-V4-Pro. For the SLO=30 configuration, NVIDIA lists 61.4K concurrent agents per megawatt and 57.5 concurrent agents per GPU on GB300 NVL72. H200 is listed at 2.6K per megawatt and 1.4 per GPU. NVIDIA summarizes that gap as up to 20x higher agentic coding performance than the previous generation.

The broader signal is that agent infrastructure will be judged differently from chat infrastructure. A coding agent may call tools repeatedly, branch unpredictably, and keep a user waiting across many turns. For data-center buyers, the useful question becomes how many sessions stay inside the promised service level for a fixed power budget, not just which accelerator wins a short-token throughput chart.

AgentPerf reframes AI infra: GB300 serves 20x more coding agents per MW

Related Articles

DynoSim makes LLM serving tuning a 1,500x faster simulation loop

DynoSim replays 60.1 minutes of inference traffic in 2.41 seconds

FrontierCode Asks Whether an AI Patch Would Actually Get Merged