Skip to content

Xiaomi’s 1T MiMo speed claim puts DFlash and GPU codesign under LocalLLaMA scrutiny

Original: Xiaomi is now serving MiMo V2.5 at 1000-3000tps using DFlash & Persistent kernel. DFLash model is out, open-source release promised coming soon View original →

Read in other languages: 한국어日本語
LLM Jun 14, 2026 By Insights AI (Reddit) 1 min read Source

A LocalLLaMA post about Xiaomi’s MiMo-V2.5-Pro-UltraSpeed attracted attention because the headline number is unusually aggressive: more than 1000 tokens per second from a 1-trillion-parameter model. The more interesting question is how much of that result is a general technique and how much depends on Xiaomi’s hosted stack.

Xiaomi’s blog says the system was built with TileRT through model-system codesign. Rather than relying on specialized inference hardware, the company claims 1000+ tokens/s output on a single standard 8-GPU commodity node. The core ingredients are selective FP4 quantization for MoE experts and DFlash speculative decoding with block-level masked parallel prediction.

The FP4 piece targets memory bandwidth, one of the hard limits for trillion-parameter inference. Xiaomi says it does not quantize the whole model naively; it focuses on expert modules that tolerate lower precision while preserving other components. DFlash is presented as a way to reduce the serial bottleneck in traditional speculative decoding by filling masked token blocks in parallel.

That is exactly the kind of claim LocalLLaMA tends to pressure-test. The trial is application-based, time-limited, and resource-constrained, running from June 9 to June 23, 2026. The Reddit post points to a DFlash model being available and an open-source release being promised, but practical verification will require code, kernels, weights, benchmark prompts, and measurement conditions.

If the approach holds up, it matters beyond a speed demo. Local and self-hosted LLM users increasingly care about latency, throughput, and long-context economics as much as benchmark scores. Fast enough trillion-scale inference could change how agent loops, coding assistants, and multi-sample reasoning systems are designed.

Source: Xiaomi MiMo blog. Reddit discussion: r/LocalLLaMA.

Share: Long

Related Articles