NVIDIA Claims Up to 50x Throughput/Watt and 35x Lower Token Costs With Blackwell Ultra for Agentic AI

What NVIDIA Reported

In a February 16, 2026 post, NVIDIA said new SemiAnalysis InferenceX data shows substantial inference efficiency gains from its Blackwell Ultra generation. For GB300 NVL72, the headline claims are up to 50x higher throughput per megawatt and up to 35x lower token cost versus the Hopper platform in low-latency agentic AI scenarios.

The post frames these gains around a specific demand pattern: software-oriented AI workloads. NVIDIA cites OpenRouter data indicating that software-programming-related AI queries increased from about 11% to roughly 50% over the last year, making low latency and long-context handling more commercially important for coding assistants and AI agents.

Why NVIDIA Says Performance Improved

NVIDIA attributes results to hardware-software co-design rather than silicon alone. It highlights continuous optimizations across TensorRT-LLM, Dynamo, Mooncake, and SGLang. According to the company, these changes have materially improved Blackwell NVL72 MoE inference throughput across latency targets, including up to 5x gains on GB200 in low-latency workloads versus four months earlier.

Kernel-level optimization for higher low-latency throughput.
NVLink Symmetric Memory for more efficient multi-GPU memory access.
Programmatic dependent launch to reduce idle gaps between kernels.

NVIDIA also said GB300 NVL72 improves long-context economics. In a workload example with 128,000-token input and 8,000-token output, the company reports up to 1.5x lower token cost versus GB200 NVL72.

Deployment Signals and Next Step

The post names Microsoft, CoreWeave, and Oracle Cloud Infrastructure as deploying GB300 NVL72 for low-latency and long-context use cases, including agentic coding and coding assistants. NVIDIA positions this as early proof that lower token costs can expand real-time, multi-step AI interactions to larger user bases.

Looking forward, NVIDIA points to Rubin as the next platform jump, claiming up to 10x higher throughput per megawatt for MoE inference versus Blackwell and lower training GPU requirements for large MoE models. As with all vendor benchmarks, realized production gains will depend on model mix, serving strategy, and workload constraints.

NVIDIA Claims Up to 50x Throughput/Watt and 35x Lower Token Costs With Blackwell Ultra for Agentic AI

What NVIDIA Reported

Why NVIDIA Says Performance Improved

Deployment Signals and Next Step

Related Articles

NVIDIA says Dynamo 1.0 is entering production as an inference OS for AI factories

NVIDIA positions Groq 3 LPX as the low-latency inference rack for Vera Rubin

Zero-copy Wasm-to-GPU inference made HN ask where the speedup really is

Comments (0)

Leave a Comment

Related Articles

NVIDIA says Dynamo 1.0 is entering production as an inference OS for AI factories
AI sources.twitter Mar 17, 2026 2 min read

NVIDIA positions Groq 3 LPX as the low-latency inference rack for Vera Rubin
AI sources.twitter Apr 2, 2026 2 min read

Zero-copy Wasm-to-GPU inference made HN ask where the speedup really is
AI Hacker News Apr 20, 2026 2 min read