LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

What LocalLLaMA tested

A LocalLLaMA post on April 11, 2026 described a native MLX implementation of DFlash, a speculative decoding method based on block diffusion, running on an M5 Max with 64GB of memory. The author says a small draft model generates 16 tokens in parallel, the target model verifies them in one forward pass, and the final output remains bit-for-bit identical to greedy baseline decoding. For local inference users, that exact-output claim matters because it positions the speedup as a systems optimization rather than a quality tradeoff.

The posted numbers are strong enough to attract attention. On Qwen3.5-9B bf16, the report shows 85 tokens per second versus 26 tokens per second for baseline generation at 1024 tokens, and 80 versus 26 at 2048 tokens. On Qwen3.5-4B bf16, the author reports 109 versus 41 at 1024 and 133 versus 42 at 2048. Even on quantized Qwen3.5-27B, the post claims roughly 1.7x to 2.5x speedups depending on whether the target is 4-bit or 8-bit.

What changed under the hood

The thread is also useful because it explains what actually moved the numbers. The author says MLX needed a small head_dim=256 patch to unlock a faster attention path for Qwen3.5-9B, the runtime was restructured to cut GPU-to-CPU synchronization points per cycle from two to one, and separate QKV projections were packed into a single matmul plus split. Acceptance rates were reported around 80% to 87%.

Just as interesting are the negative results. According to the post, custom Metal kernels for batched GEMV, fused gated SiLU, and SDPA ended up slower than stock MLX kernels on unified-memory Apple hardware. The author also says verification cost stayed nearly flat when increasing from 4 to 16 tokens, which suggests weight loading dominates more than token count in this environment. On quantized targets, the draft model can become the bottleneck rather than the verifier, flipping the usual speculative decoding intuition.

Why this matters for local inference

The broader takeaway is that Apple Silicon optimization is starting to look like its own discipline rather than a smaller version of CUDA tuning. Techniques that make intuitive sense on discrete GPU stacks do not necessarily win once unified memory bandwidth, MLX kernels, and quantized verification paths shape the bottleneck. That makes firsthand community reports unusually valuable.

The post is still a work in progress and the author says the implementation is not open sourced yet. Even so, it offers a concrete picture of where the next wave of local LLM speedups may come from: exact speculative decoding, targeted runtime surgery, and a better understanding of when draft-versus-verify balance changes across model sizes and quantization levels.

Source links: Reddit thread, DFlash paper.

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

What LocalLLaMA tested

What changed under the hood

Why this matters for local inference

Related Articles

Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

Related Articles

Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon
LLM Reddit Apr 14, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max
LLM Reddit Mar 30, 2026 2 min read