NVIDIA and SGLang Claim Major DeepSeek R1 Inference Speedups
Original: NVIDIA and SGLang report 25x DeepSeek R1 inference gain on GB300 NVL72 versus H200 View original →
Performance claims in the post
In an X post on March 3, 2026, NVIDIA AI Developer said its latest collaboration with SGLang delivered major DeepSeek R1 inference gains: up to 25x faster throughput on GB300 NVL72 versus H200, plus an 8x performance increase on GB200 NVL72 in less than four months. The post also states that the optimizations lower cost per token while improving large-scale MoE serving performance.
Tech levers cited by NVIDIA and SGLang
The announcement names three technical factors: NVFP4 precision, NVIDIA Dynamo-powered disaggregation, and improved computation-communication overlap. A quoted LMSYS post presents the same directional result and frames it as InferenceXv2 progress on Blackwell-class systems. The broader implication is that system-level serving design, not only model architecture, is now a primary lever for deployment economics in production MoE workloads.
How to interpret the numbers
The reported multipliers are significant and relevant for operators planning hardware refresh cycles, but they are still vendor- and workload-specific claims. Throughput deltas can vary heavily by token-rate targets, sequence profiles, scheduling strategy, and kernel maturity. Even with that caveat, the disclosure is notable because it combines architecture-level upgrades with concrete serving-engine methods and ties them directly to real deployment cost signals.
Sources: NVIDIA AI Developer X post, LMSYS quoted X post, LMSYS blog index
Related Articles
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.
NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.
Open-model competition is shifting from leaderboard scores to agent operating costs. NVIDIA says Nemotron 3 Ultra is a 550B MoE model with 5x faster inference and up to 30% lower cost for complex agentic tasks.