r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX
Original: DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) View original →
A fresh r/LocalLLaMA post, published on April 14, 2026 KST, reported a native MLX implementation of DFlash for Apple Silicon. The author describes a lossless speculative decoding flow in which a small draft model generates 16 tokens in parallel and the target model verifies them in a single forward pass before committing them. The post also says earlier numerical issues were fixed, the benchmark methodology was rewritten, and the code is now open source in the dflash-mlx repository.
The useful part of the write-up is that it does not stop at the headline speedup. The benchmark setup is explicit: M5 Max, 64GB, MLX 0.31.1, stock mlx_lm.stream_generate as the baseline, three runs, median reported, and a 10-second cooldown between runs. At 2048 output tokens, the reported numbers are concrete enough to evaluate rather than just admire.
| Model | Baseline | DFlash | Speedup | Acceptance |
|---|---|---|---|---|
| Qwen3.5-4B | 53.74 tok/s | 219.83 tok/s | 4.10x | 89.3% |
| Qwen3.5-9B | 30.96 tok/s | 127.07 tok/s | 4.13x | 89.4% |
| Qwen3.5-27B-4bit | 32.35 tok/s | 62.78 tok/s | 1.90x | 89.1% |
| Qwen3.5-35B-A3B-4bit | 142.12 tok/s | 240.21 tok/s | 1.69x | 88.7% |
The interpretation is at least as important as the table. According to the post, Apple Silicon's unified memory makes the workload bandwidth-bound more than compute-bound. Attempts to outperform stock MLX with custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA came back slower, so the claimed gain mostly comes from numerical precision choices rather than exotic compute tricks. That is a much more useful engineering claim than a generic “4x faster” banner.
The lower 1.90x result on Qwen3.5-27B-4bit is explained as a structural limit: once the quantized target is already fast, the bf16 draft model becomes the new bottleneck. The implementation is also tuned for Qwen3.5's hybrid GatedDeltaNet + attention architecture, while pure attention models such as Qwen3 and Gemma are said to work without the same tape-replay advantage. For local-LLM builders on Apple hardware, the post matters because it frames speculative decoding as an engineering trade-off problem about baselines, architecture fit, quantization, and memory bandwidth, not as a marketing slogan.
Related Articles
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
Comments (0)
No comments yet. Be the first to comment!