Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon

Original: DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) View original →

Read in other languages: 한국어日本語
LLM Apr 14, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why Reddit took this seriously

This did not explode as a giant meme thread, but it landed well with the part of LocalLLaMA that is usually skeptical of speedup claims. The author explicitly says they rewrote the benchmark methodology, fixed numerical issues, and then open-sourced the whole implementation instead of leaning on earlier, more flattering numbers. That matters in this community. Local LLM users are used to screenshots that compare against weak baselines or custom loops nobody else actually runs. Here, the post says the baseline is back to stock mlx_lm.stream_generate. At crawl time, the thread had 105 points and 36 comments, and the early response framed it as one of the more credible speculative-decoding implementations currently circulating for dense Qwen3.5 on Apple Silicon.

What the runtime actually does

According to the repo, dflash-mlx implements the 2026 DFlash speculative-decoding approach on top of MLX. A draft model generates 16 tokens in parallel using block diffusion, and the target verifies those tokens in a single forward pass. The project describes the output as lossless, meaning no token is emitted unless it has been verified by the target model before commit. The reported hardware setup is Apple M5 Max with 64GB unified memory on MLX 0.31.1. The headline number for Qwen3.5-9B at 2048 generated tokens is a jump from 30.96 tok/s to 127.07 tok/s, a 4.13x speedup with 89.36% acceptance.

Where the gains come from

One of the strongest details in both the post and the README is that the author does not claim magic from custom kernels alone. In fact, they say many obvious low-level attempts came back slower than stock MLX because Apple Silicon local inference is heavily bandwidth-bound. The important wins came from tape-replay rollback, a JIT 2-pass SDPA path for longer contexts, and the numerical work needed to keep speculative verify cycles stable. That is why the acceptance rate stays near 89% over long generations instead of collapsing. This is more interesting than a benchmark screenshot because it identifies the practical bottleneck: keeping verification and rollback coherent enough that speculative decoding stays useful at real sequence lengths.

Why the thread saw it as useful, not just clever

The practical appeal is obvious for Mac-based local inference. If the numbers hold, a stock-MLX workflow around Qwen models becomes much more usable without a forked runtime or a proprietary serving stack. The repo is also careful about where the gains shrink. For example, the README says Qwen3.5-27B-4bit sees smaller speedups because the quantized target is already fast enough that the bf16 draft becomes part of the bottleneck. That kind of caveat helped the post feel more believable. LocalLLaMA reacted to this as an engineering improvement that could actually change day-to-day throughput on Apple hardware, not as another overfitted benchmark brag.

Sources: dflash-mlx GitHub · DFlash paper · Reddit discussion

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.