NVIDIA and SGLang Claim Major DeepSeek R1 Inference Speedups

Performance claims in the post

In an X post on March 3, 2026, NVIDIA AI Developer said its latest collaboration with SGLang delivered major DeepSeek R1 inference gains: up to 25x faster throughput on GB300 NVL72 versus H200, plus an 8x performance increase on GB200 NVL72 in less than four months. The post also states that the optimizations lower cost per token while improving large-scale MoE serving performance.

Tech levers cited by NVIDIA and SGLang

The announcement names three technical factors: NVFP4 precision, NVIDIA Dynamo-powered disaggregation, and improved computation-communication overlap. A quoted LMSYS post presents the same directional result and frames it as InferenceXv2 progress on Blackwell-class systems. The broader implication is that system-level serving design, not only model architecture, is now a primary lever for deployment economics in production MoE workloads.

How to interpret the numbers

The reported multipliers are significant and relevant for operators planning hardware refresh cycles, but they are still vendor- and workload-specific claims. Throughput deltas can vary heavily by token-rate targets, sequence profiles, scheduling strategy, and kernel maturity. Even with that caveat, the disclosure is notable because it combines architecture-level upgrades with concrete serving-engine methods and ties them directly to real deployment cost signals.

Sources: NVIDIA AI Developer X post, LMSYS quoted X post, LMSYS blog index

LLM X/Twitter 1d ago 1 min read

NVIDIA ModelExpress Cuts DeepSeek-V4 Pro Startup From 8 Minutes

NVIDIA says ModelExpress reduced DeepSeek-V4 Pro startup from 8 minutes to 1 minute 44 seconds by moving weights directly over GPU-to-GPU RDMA.

#nvidia #modelexpress #inference

LLM Reddit Jun 26, 2026 1 min read

NVIDIA’s Nemotron-TwoTower tests diffusion-style generation for LLMs

LocalLLaMA focused on the practical question: can a diffusion LLM keep quality while making generation meaningfully faster?

#nvidia #nemotron #diffusion

LLM Mar 30, 2026 2 min read

NVIDIA puts Dynamo 1.0 into production as an inference OS for AI factories

NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.

#nvidia #dynamo #inference

107