A fresh r/LocalLLaMA post, published on April 14, 2026 KST, reported a native MLX implementation of DFlash for Apple Silicon. The author describes a lossless speculative decoding flow in which a small draft model generates 16 tokens in parallel and the target model verifies them in a single forward pass before committing them. The post also says earlier numerical issues were fixed, the benchmark methodology was rewritten, and the code is now open source in the dflash-mlx repository.

The useful part of the write-up is that it does not stop at the headline speedup. The benchmark setup is explicit: M5 Max, 64GB, MLX 0.31.1, stock mlx_lm.stream_generate as the baseline, three runs, median reported, and a 10-second cooldown between runs. At 2048 output tokens, the reported numbers are concrete enough to evaluate rather than just admire.

Model	Baseline	DFlash	Speedup	Acceptance
Qwen3.5-4B	53.74 tok/s	219.83 tok/s	4.10x	89.3%
Qwen3.5-9B	30.96 tok/s	127.07 tok/s	4.13x	89.4%
Qwen3.5-27B-4bit	32.35 tok/s	62.78 tok/s	1.90x	89.1%
Qwen3.5-35B-A3B-4bit	142.12 tok/s	240.21 tok/s	1.69x	88.7%

The interpretation is at least as important as the table. According to the post, Apple Silicon's unified memory makes the workload bandwidth-bound more than compute-bound. Attempts to outperform stock MLX with custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA came back slower, so the claimed gain mostly comes from numerical precision choices rather than exotic compute tricks. That is a much more useful engineering claim than a generic “4x faster” banner.

The lower 1.90x result on Qwen3.5-27B-4bit is explained as a structural limit: once the quantized target is already fast, the bf16 draft model becomes the new bottleneck. The implementation is also tuned for Qwen3.5's hybrid GatedDeltaNet + attention architecture, while pure attention models such as Qwen3 and Gemma are said to work without the same tape-replay advantage. For local-LLM builders on Apple hardware, the post matters because it frames speculative decoding as an engineering trade-off problem about baselines, architecture fit, quantization, and memory bandwidth, not as a marketing slogan.

#qwen3-5

r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX