Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon
Original: DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) View original →
Why Reddit took this seriously
This did not explode as a giant meme thread, but it landed well with the part of LocalLLaMA that is usually skeptical of speedup claims. The author explicitly says they rewrote the benchmark methodology, fixed numerical issues, and then open-sourced the whole implementation instead of leaning on earlier, more flattering numbers. That matters in this community. Local LLM users are used to screenshots that compare against weak baselines or custom loops nobody else actually runs. Here, the post says the baseline is back to stock mlx_lm.stream_generate. At crawl time, the thread had 105 points and 36 comments, and the early response framed it as one of the more credible speculative-decoding implementations currently circulating for dense Qwen3.5 on Apple Silicon.
What the runtime actually does
According to the repo, dflash-mlx implements the 2026 DFlash speculative-decoding approach on top of MLX. A draft model generates 16 tokens in parallel using block diffusion, and the target verifies those tokens in a single forward pass. The project describes the output as lossless, meaning no token is emitted unless it has been verified by the target model before commit. The reported hardware setup is Apple M5 Max with 64GB unified memory on MLX 0.31.1. The headline number for Qwen3.5-9B at 2048 generated tokens is a jump from 30.96 tok/s to 127.07 tok/s, a 4.13x speedup with 89.36% acceptance.
Where the gains come from
One of the strongest details in both the post and the README is that the author does not claim magic from custom kernels alone. In fact, they say many obvious low-level attempts came back slower than stock MLX because Apple Silicon local inference is heavily bandwidth-bound. The important wins came from tape-replay rollback, a JIT 2-pass SDPA path for longer contexts, and the numerical work needed to keep speculative verify cycles stable. That is why the acceptance rate stays near 89% over long generations instead of collapsing. This is more interesting than a benchmark screenshot because it identifies the practical bottleneck: keeping verification and rollback coherent enough that speculative decoding stays useful at real sequence lengths.
Why the thread saw it as useful, not just clever
The practical appeal is obvious for Mac-based local inference. If the numbers hold, a stock-MLX workflow around Qwen models becomes much more usable without a forked runtime or a proprietary serving stack. The repo is also careful about where the gains shrink. For example, the README says Qwen3.5-27B-4bit sees smaller speedups because the quantized target is already fast enough that the bf16 draft becomes part of the bottleneck. That kind of caveat helped the post feel more believable. LocalLLaMA reacted to this as an engineering improvement that could actually change day-to-day throughput on Apple hardware, not as another overfitted benchmark brag.
Sources: dflash-mlx GitHub · DFlash paper · Reddit discussion
Related Articles
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
A fresh r/LocalLLaMA post published DFlash benchmarking on M5 Max with MLX 0.31.1 and reported 127.07 tok/s and a 4.13x speedup on Qwen3.5-9B. The most useful part is not the headline number but the post’s clear reproduction setup and bandwidth-bound interpretation.
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
Comments (0)
No comments yet. Be the first to comment!