NVIDIA Claims Up to 50x Throughput/Watt and 35x Lower Token Costs With Blackwell Ultra for Agentic AI

Original: New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Costs for Agentic AI View original →

Read in other languages: 한국어日本語
AI Feb 17, 2026 By Insights AI 2 min read 5 views Source

What NVIDIA Reported

In a February 16, 2026 post, NVIDIA said new SemiAnalysis InferenceX data shows substantial inference efficiency gains from its Blackwell Ultra generation. For GB300 NVL72, the headline claims are up to 50x higher throughput per megawatt and up to 35x lower token cost versus the Hopper platform in low-latency agentic AI scenarios.

The post frames these gains around a specific demand pattern: software-oriented AI workloads. NVIDIA cites OpenRouter data indicating that software-programming-related AI queries increased from about 11% to roughly 50% over the last year, making low latency and long-context handling more commercially important for coding assistants and AI agents.

Why NVIDIA Says Performance Improved

NVIDIA attributes results to hardware-software co-design rather than silicon alone. It highlights continuous optimizations across TensorRT-LLM, Dynamo, Mooncake, and SGLang. According to the company, these changes have materially improved Blackwell NVL72 MoE inference throughput across latency targets, including up to 5x gains on GB200 in low-latency workloads versus four months earlier.

  • Kernel-level optimization for higher low-latency throughput.
  • NVLink Symmetric Memory for more efficient multi-GPU memory access.
  • Programmatic dependent launch to reduce idle gaps between kernels.

NVIDIA also said GB300 NVL72 improves long-context economics. In a workload example with 128,000-token input and 8,000-token output, the company reports up to 1.5x lower token cost versus GB200 NVL72.

Deployment Signals and Next Step

The post names Microsoft, CoreWeave, and Oracle Cloud Infrastructure as deploying GB300 NVL72 for low-latency and long-context use cases, including agentic coding and coding assistants. NVIDIA positions this as early proof that lower token costs can expand real-time, multi-step AI interactions to larger user bases.

Looking forward, NVIDIA points to Rubin as the next platform jump, claiming up to 10x higher throughput per megawatt for MoE inference versus Blackwell and lower training GPU requirements for large MoE models. As with all vendor benchmarks, realized production gains will depend on model mix, serving strategy, and workload constraints.

Share:

Related Articles

AI sources.news Feb 18, 2026 1 min read

NVIDIA announced on February 17, 2026 that Meta is scaling AI infrastructure using GB300 NVL72 systems, RTX PRO servers, Spectrum-X Ethernet, and Mission Control software. The move extends Meta’s large Hopper footprint into a broader Blackwell-era operations model.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.