LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference
Original: DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max) View original →
What LocalLLaMA tested
A LocalLLaMA post on April 11, 2026 described a native MLX implementation of DFlash, a speculative decoding method based on block diffusion, running on an M5 Max with 64GB of memory. The author says a small draft model generates 16 tokens in parallel, the target model verifies them in one forward pass, and the final output remains bit-for-bit identical to greedy baseline decoding. For local inference users, that exact-output claim matters because it positions the speedup as a systems optimization rather than a quality tradeoff.
The posted numbers are strong enough to attract attention. On Qwen3.5-9B bf16, the report shows 85 tokens per second versus 26 tokens per second for baseline generation at 1024 tokens, and 80 versus 26 at 2048 tokens. On Qwen3.5-4B bf16, the author reports 109 versus 41 at 1024 and 133 versus 42 at 2048. Even on quantized Qwen3.5-27B, the post claims roughly 1.7x to 2.5x speedups depending on whether the target is 4-bit or 8-bit.
What changed under the hood
The thread is also useful because it explains what actually moved the numbers. The author says MLX needed a small head_dim=256 patch to unlock a faster attention path for Qwen3.5-9B, the runtime was restructured to cut GPU-to-CPU synchronization points per cycle from two to one, and separate QKV projections were packed into a single matmul plus split. Acceptance rates were reported around 80% to 87%.
Just as interesting are the negative results. According to the post, custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA ended up slower than stock MLX kernels on unified-memory Apple hardware. The author also says verification cost stayed nearly flat when increasing from 4 to 16 tokens, which suggests weight loading dominates more than token count in this environment. On quantized targets, the draft model can become the bottleneck rather than the verifier, flipping the usual speculative decoding intuition.
Why this matters for local inference
The broader takeaway is that Apple Silicon optimization is starting to look like its own discipline rather than a smaller version of CUDA tuning. Techniques that make intuitive sense on discrete GPU stacks do not necessarily win once unified memory bandwidth, MLX kernels, and quantized verification paths shape the bottleneck. That makes firsthand community reports unusually valuable.
The post is still a work in progress and the author says the implementation is not open sourced yet. Even so, it offers a concrete picture of where the next wave of local LLM speedups may come from: exact speculative decoding, targeted runtime surgery, and a better understanding of when draft-versus-verify balance changes across model sizes and quantization levels.
Source links: Reddit thread, DFlash paper.
Related Articles
A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
Comments (0)
No comments yet. Be the first to comment!