LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

Original: DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max) View original →

Read in other languages: 한국어日本語
LLM Apr 11, 2026 By Insights AI (Reddit) 2 min read 2 views Source

What LocalLLaMA tested

A LocalLLaMA post on April 11, 2026 described a native MLX implementation of DFlash, a speculative decoding method based on block diffusion, running on an M5 Max with 64GB of memory. The author says a small draft model generates 16 tokens in parallel, the target model verifies them in one forward pass, and the final output remains bit-for-bit identical to greedy baseline decoding. For local inference users, that exact-output claim matters because it positions the speedup as a systems optimization rather than a quality tradeoff.

The posted numbers are strong enough to attract attention. On Qwen3.5-9B bf16, the report shows 85 tokens per second versus 26 tokens per second for baseline generation at 1024 tokens, and 80 versus 26 at 2048 tokens. On Qwen3.5-4B bf16, the author reports 109 versus 41 at 1024 and 133 versus 42 at 2048. Even on quantized Qwen3.5-27B, the post claims roughly 1.7x to 2.5x speedups depending on whether the target is 4-bit or 8-bit.

What changed under the hood

The thread is also useful because it explains what actually moved the numbers. The author says MLX needed a small head_dim=256 patch to unlock a faster attention path for Qwen3.5-9B, the runtime was restructured to cut GPU-to-CPU synchronization points per cycle from two to one, and separate QKV projections were packed into a single matmul plus split. Acceptance rates were reported around 80% to 87%.

Just as interesting are the negative results. According to the post, custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA ended up slower than stock MLX kernels on unified-memory Apple hardware. The author also says verification cost stayed nearly flat when increasing from 4 to 16 tokens, which suggests weight loading dominates more than token count in this environment. On quantized targets, the draft model can become the bottleneck rather than the verifier, flipping the usual speculative decoding intuition.

Why this matters for local inference

The broader takeaway is that Apple Silicon optimization is starting to look like its own discipline rather than a smaller version of CUDA tuning. Techniques that make intuitive sense on discrete GPU stacks do not necessarily win once unified memory bandwidth, MLX kernels, and quantized verification paths shape the bottleneck. That makes firsthand community reports unusually valuable.

The post is still a work in progress and the author says the implementation is not open sourced yet. Even so, it offers a concrete picture of where the next wave of local LLM speedups may come from: exact speculative decoding, targeted runtime surgery, and a better understanding of when draft-versus-verify balance changes across model sizes and quantization levels.

Source links: Reddit thread, DFlash paper.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.