Xiaomi’s 1T MiMo speed claim puts DFlash and GPU codesign under LocalLLaMA scrutiny

A LocalLLaMA post about Xiaomi’s MiMo-V2.5-Pro-UltraSpeed attracted attention because the headline number is unusually aggressive: more than 1000 tokens per second from a 1-trillion-parameter model. The more interesting question is how much of that result is a general technique and how much depends on Xiaomi’s hosted stack.

Xiaomi’s blog says the system was built with TileRT through model-system codesign. Rather than relying on specialized inference hardware, the company claims 1000+ tokens/s output on a single standard 8-GPU commodity node. The core ingredients are selective FP4 quantization for MoE experts and DFlash speculative decoding with block-level masked parallel prediction.

The FP4 piece targets memory bandwidth, one of the hard limits for trillion-parameter inference. Xiaomi says it does not quantize the whole model naively; it focuses on expert modules that tolerate lower precision while preserving other components. DFlash is presented as a way to reduce the serial bottleneck in traditional speculative decoding by filling masked token blocks in parallel.

That is exactly the kind of claim LocalLLaMA tends to pressure-test. The trial is application-based, time-limited, and resource-constrained, running from June 9 to June 23, 2026. The Reddit post points to a DFlash model being available and an open-source release being promised, but practical verification will require code, kernels, weights, benchmark prompts, and measurement conditions.

If the approach holds up, it matters beyond a speed demo. Local and self-hosted LLM users increasingly care about latency, throughput, and long-context economics as much as benchmark scores. Fast enough trillion-scale inference could change how agent loops, coding assistants, and multi-sample reasoning systems are designed.

Source: Xiaomi MiMo blog. Reddit discussion: r/LocalLLaMA.

Xiaomi’s 1T MiMo speed claim puts DFlash and GPU codesign under LocalLLaMA scrutiny

Related Articles

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB

GLM5.2 at home turns local LLM enthusiasm into a hardware bill

Open-Weight AI Letter Turns Into a LocalLLaMA Policy Fight

Related Articles

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB
LLM Reddit May 6, 2026 1 min read

GLM5.2 at home turns local LLM enthusiasm into a hardware bill
LLM Reddit Jul 4, 2026 1 min read

Open-Weight AI Letter Turns Into a LocalLLaMA Policy Fight