Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

LLM Reddit Jun 14, 2026 1 min read

Xiaomi’s 1T MiMo speed claim puts DFlash and GPU codesign under LocalLLaMA scrutiny

The LocalLLaMA angle is not just the 1000+ tps headline, but whether FP4, DFlash, and commodity GPU kernels can be reproduced outside Xiaomi’s hosted trial.

#xiaomi #mimo #inference

LLM Hacker News May 16, 2026 1 min read

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.

#inference #qwen3 #speculative-decoding

LLM Reddit May 6, 2026 1 min read

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB

A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.

#qwen #mtp #local-llm

LLM Reddit May 6, 2026 1 min read

Google Releases Multi-Token Prediction Drafters for Gemma 4: Up to 3x Speedup

Google has released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, achieving up to 3x inference speedup through speculative decoding without any loss in output quality.

#gemma #google #mtp

LLM Reddit Apr 28, 2026 2 min read

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical

LocalLLaMA did not treat Luce DFlash as another benchmark screenshot. The post took off because it promised almost 2x mean throughput for Qwen3.6-27B on a single RTX 3090, with no retraining and enough memory engineering to keep long-context local inference practical.

#qwen #speculative-decoding #gguf

LLM Reddit Apr 14, 2026 2 min read

Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon

LocalLLaMA paid attention to this post because it looked like real engineering cleanup instead of another inflated speed screenshot. On April 13, 2026, the author said a stock-MLX baseline for Qwen3.5-9B at 2048 tokens improved from 30.96 tok/s to 127.07 tok/s, with 89.36% acceptance and the full runtime released as open source.

#dflash #speculative-decoding #mlx

LLM Reddit Apr 13, 2026 2 min read

r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX

A fresh r/LocalLLaMA post published DFlash benchmarking on M5 Max with MLX 0.31.1 and reported 127.07 tok/s and a 4.13x speedup on Qwen3.5-9B. The most useful part is not the headline number but the post’s clear reproduction setup and bandwidth-bound interpretation.

#mlx #apple-silicon #speculative-decoding

LLM Reddit Apr 12, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

A new r/LocalLLaMA benchmark reports that Gemma 4 31B paired with an E2B draft model can gain about 29% average throughput, with code generation improving by roughly 50%.

#gemma-4 #speculative-decoding #llama-cpp

LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.

#apple-silicon #mlx #speculative-decoding

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#speculative-decoding #inference #vllm

LLM X/Twitter Apr 1, 2026 2 min read

Together Research releases Aurora for RL-based adaptive speculative decoding

Together Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.

#together-ai #aurora #speculative-decoding

LLM Reddit Mar 21, 2026 3 min read

r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference

A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.

#mlx-lm #qwen3.5 #mtp