NVIDIA Claims Up to 50x Throughput/Watt and 35x Lower Token Costs With Blackwell Ultra for Agentic AI
Original: New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Costs for Agentic AI View original →
What NVIDIA Reported
In a February 16, 2026 post, NVIDIA said new SemiAnalysis InferenceX data shows substantial inference efficiency gains from its Blackwell Ultra generation. For GB300 NVL72, the headline claims are up to 50x higher throughput per megawatt and up to 35x lower token cost versus the Hopper platform in low-latency agentic AI scenarios.
The post frames these gains around a specific demand pattern: software-oriented AI workloads. NVIDIA cites OpenRouter data indicating that software-programming-related AI queries increased from about 11% to roughly 50% over the last year, making low latency and long-context handling more commercially important for coding assistants and AI agents.
Why NVIDIA Says Performance Improved
NVIDIA attributes results to hardware-software co-design rather than silicon alone. It highlights continuous optimizations across TensorRT-LLM, Dynamo, Mooncake, and SGLang. According to the company, these changes have materially improved Blackwell NVL72 MoE inference throughput across latency targets, including up to 5x gains on GB200 in low-latency workloads versus four months earlier.
- Kernel-level optimization for higher low-latency throughput.
- NVLink Symmetric Memory for more efficient multi-GPU memory access.
- Programmatic dependent launch to reduce idle gaps between kernels.
NVIDIA also said GB300 NVL72 improves long-context economics. In a workload example with 128,000-token input and 8,000-token output, the company reports up to 1.5x lower token cost versus GB200 NVL72.
Deployment Signals and Next Step
The post names Microsoft, CoreWeave, and Oracle Cloud Infrastructure as deploying GB300 NVL72 for low-latency and long-context use cases, including agentic coding and coding assistants. NVIDIA positions this as early proof that lower token costs can expand real-time, multi-step AI interactions to larger user bases.
Looking forward, NVIDIA points to Rubin as the next platform jump, claiming up to 10x higher throughput per megawatt for MoE inference versus Blackwell and lower training GPU requirements for large MoE models. As with all vendor benchmarks, realized production gains will depend on model mix, serving strategy, and workload constraints.
Related Articles
NVIDIA said on March 16, 2026 that Dynamo 1.0 is entering production as open source software for generative and agentic inference at scale. The company says the stack can raise Blackwell inference performance by up to 7x and is already supported across major cloud providers, inference platforms, and AI-native companies.
NVIDIA’s February 19, 2026 telecom AI survey indicates that operators are tying AI adoption directly to revenue, cost, and automation outcomes. The data points to rising 2026 AI budgets and a faster move toward AI-native networks ahead of traditional 6G timelines.
NAVER plans to expand GAK Sejong to 55MW and eventually toward gigawatt-scale AI factory capacity. NVIDIA’s post frames DSX as the stack for sovereign AI, HyperCLOVA X, and agentic services.