NVIDIA Claims Up to 50x Throughput/Watt and 35x Lower Token Costs With Blackwell Ultra for Agentic AI
Original: New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Costs for Agentic AI View original →
What NVIDIA Reported
In a February 16, 2026 post, NVIDIA said new SemiAnalysis InferenceX data shows substantial inference efficiency gains from its Blackwell Ultra generation. For GB300 NVL72, the headline claims are up to 50x higher throughput per megawatt and up to 35x lower token cost versus the Hopper platform in low-latency agentic AI scenarios.
The post frames these gains around a specific demand pattern: software-oriented AI workloads. NVIDIA cites OpenRouter data indicating that software-programming-related AI queries increased from about 11% to roughly 50% over the last year, making low latency and long-context handling more commercially important for coding assistants and AI agents.
Why NVIDIA Says Performance Improved
NVIDIA attributes results to hardware-software co-design rather than silicon alone. It highlights continuous optimizations across TensorRT-LLM, Dynamo, Mooncake, and SGLang. According to the company, these changes have materially improved Blackwell NVL72 MoE inference throughput across latency targets, including up to 5x gains on GB200 in low-latency workloads versus four months earlier.
- Kernel-level optimization for higher low-latency throughput.
- NVLink Symmetric Memory for more efficient multi-GPU memory access.
- Programmatic dependent launch to reduce idle gaps between kernels.
NVIDIA also said GB300 NVL72 improves long-context economics. In a workload example with 128,000-token input and 8,000-token output, the company reports up to 1.5x lower token cost versus GB200 NVL72.
Deployment Signals and Next Step
The post names Microsoft, CoreWeave, and Oracle Cloud Infrastructure as deploying GB300 NVL72 for low-latency and long-context use cases, including agentic coding and coding assistants. NVIDIA positions this as early proof that lower token costs can expand real-time, multi-step AI interactions to larger user bases.
Looking forward, NVIDIA points to Rubin as the next platform jump, claiming up to 10x higher throughput per megawatt for MoE inference versus Blackwell and lower training GPU requirements for large MoE models. As with all vendor benchmarks, realized production gains will depend on model mix, serving strategy, and workload constraints.
Related Articles
NVIDIA said on March 16, 2026 that Dynamo 1.0 is entering production as open source software for generative and agentic inference at scale. The company says the stack can raise Blackwell inference performance by up to 7x and is already supported across major cloud providers, inference platforms, and AI-native companies.
On March 17, 2026, NVIDIADC described Groq 3 LPX on X as a new rack-scale low-latency inference accelerator for the Vera Rubin platform. NVIDIA’s March 16 press release and technical blog say LPX brings 256 LPUs, 128 GB of on-chip SRAM, and 640 TB/s of scale-up bandwidth into a heterogeneous inference path with Vera Rubin NVL72 for agentic AI workloads.
HN found this interesting because it tests a real boundary: whether Apple Silicon unified memory can make a Wasm sandbox and a GPU buffer operate on the same bytes.
Comments (0)
No comments yet. Be the first to comment!