LMSYS posts Day-0 DeepSeek-V4 speeds up to 266 tok/s on H200
Original: LMSYS published Day-0 DeepSeek-V4 inference and RL support results View original →
Why this tweet mattered more than another benchmark brag
LMSYS published the kind of systems post that determines whether a model is usable on day one. The organization’s account wrote that it had shipped Day-0 support for DeepSeek-V4 through SGLang and Miles, and it attached concrete throughput numbers rather than broad superlatives. The tweet claims 199 tok/s on B200 for the 1.6T Pro model and 266 tok/s on H200 for the 284B Flash model at 4K context, while saying throughput still holds up at 900K context with only modest drop-off.
“199 tok/s on B200… 266 tok/s on H200… throughput stays strong at 900K context.”
LMSYS is not a random hype account. It is one of the more closely watched accounts for model evaluation and serving-stack work, and its linked blog is a dense technical breakdown rather than a short promo page. The article, dated April 25, 2026, describes Day-0 inference and RL support for DeepSeek-V4 and spells out what had to be built around hybrid sparse attention, manifold-constrained hyper-connections, and FP4 expert weights. It also notes a 1M-token context window and claims up to 3x throughput improvement for long-context serving via HiSparse.
Why systems support is the real product here
Benchmark headlines often flatten the hard part of model launches. A new checkpoint matters only if inference kernels, cache strategies, expert routing, and training stacks can keep up. LMSYS’s post is interesting because it treats DeepSeek-V4 as a deployment problem, not just a model artifact. The linked write-up also claims that a fused compression path can reach up to 80% of peak memory bandwidth on H200 and run more than 10x faster than a naive PyTorch pipeline in that stage.
What to watch next is whether other open-source serving stacks reproduce these numbers and whether launch-day support turns into stable support once real user traffic arrives. If the LMSYS figures hold, DeepSeek-V4 will not only be notable for its open weights but also for how quickly the surrounding software stack caught up. Source: LMSYS source tweet · LMSYS technical blog
Related Articles
HN found this interesting because it tests a real boundary: whether Apple Silicon unified memory can make a Wasm sandbox and a GPU buffer operate on the same bytes.
Why it matters: enterprise OCR failures break agents long before they show up on academic PDF benchmarks. LlamaIndex says ParseBench evaluates about 2,000 human-verified pages with over 167,000 rules across 14 methods on Kaggle.
Why it matters: OpenAI is targeting a regulated workflow where accuracy claims carry direct clinical consequences. The linked rollout cites 6,924 physician-reviewed conversations and a 99.6% safe/accurate rating in internal review.
Comments (0)
No comments yet. Be the first to comment!