r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX

Original: DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) View original →

Read in other languages: 한국어日本語
LLM Apr 13, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A fresh r/LocalLLaMA post, published on April 14, 2026 KST, reported a native MLX implementation of DFlash for Apple Silicon. The author describes a lossless speculative decoding flow in which a small draft model generates 16 tokens in parallel and the target model verifies them in a single forward pass before committing them. The post also says earlier numerical issues were fixed, the benchmark methodology was rewritten, and the code is now open source in the dflash-mlx repository.

The useful part of the write-up is that it does not stop at the headline speedup. The benchmark setup is explicit: M5 Max, 64GB, MLX 0.31.1, stock mlx_lm.stream_generate as the baseline, three runs, median reported, and a 10-second cooldown between runs. At 2048 output tokens, the reported numbers are concrete enough to evaluate rather than just admire.

ModelBaselineDFlashSpeedupAcceptance
Qwen3.5-4B53.74 tok/s219.83 tok/s4.10x89.3%
Qwen3.5-9B30.96 tok/s127.07 tok/s4.13x89.4%
Qwen3.5-27B-4bit32.35 tok/s62.78 tok/s1.90x89.1%
Qwen3.5-35B-A3B-4bit142.12 tok/s240.21 tok/s1.69x88.7%

The interpretation is at least as important as the table. According to the post, Apple Silicon's unified memory makes the workload bandwidth-bound more than compute-bound. Attempts to outperform stock MLX with custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA came back slower, so the claimed gain mostly comes from numerical precision choices rather than exotic compute tricks. That is a much more useful engineering claim than a generic “4x faster” banner.

The lower 1.90x result on Qwen3.5-27B-4bit is explained as a structural limit: once the quantized target is already fast, the bf16 draft model becomes the new bottleneck. The implementation is also tuned for Qwen3.5's hybrid GatedDeltaNet + attention architecture, while pure attention models such as Qwen3 and Gemma are said to work without the same tape-replay advantage. For local-LLM builders on Apple hardware, the post matters because it frames speculative decoding as an engineering trade-off problem about baselines, architecture fit, quantization, and memory bandwidth, not as a marketing slogan.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.