r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference
Original: Multi-Token Prediction (MTP) for qwen-3.5 is coming to mlx-lm View original →
What surfaced in r/LocalLLaMA
A Reddit post in r/LocalLLaMA with 99 points and 17 comments at capture time highlighted an open change to mlx-lm: PR #990, titled feat: native MTP speculative decoding for Qwen3.5. The thread mattered because it translated a code-level change into numbers that local inference operators immediately understand. The headline benchmark was 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. For people running Apple Silicon setups, that is not just an implementation curiosity. It suggests that a runtime feature, rather than a model replacement, may materially reduce latency for interactive generation.
What the upstream PR actually adds
The PR summary says Qwen3.5 checkpoints include a built-in Multi-Token Prediction head, exposed through mtp_num_hidden_layers: 1. That head predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. The practical implication is important: mlx-lm can use native speculative decoding without a separate draft model. Instead of serving and synchronizing two models, the runtime spends only one extra transformer layer of compute to draft additional tokens before verification.
- Model support is added in
qwen3_5.py. generate.pyadds an explicit--mtpflag.server.pyexposes the same--mtppath for serving.- The PR also adds cache rollback support and 8 unit tests.
The verification path is a key detail for practitioners. The generation loop proposes drafts, verifies them, and on rejection rolls back SSM state and trims KV cache entries. That sounds like an internal implementation detail, but it is exactly what determines whether speculative decoding remains correct under rejection and therefore whether it is safe to deploy beyond a benchmark script.
Why the latency tradeoff matters
The appeal is straightforward. Native MTP is operationally simpler than two-model speculative decoding, it is available from both generation and server entry points, and the reported single-stream gain on a Mac is large enough to matter. But the limitations are just as important. The PR is still open, it requires checkpoints converted with MTP weights preserved, batching is disabled when MTP is active, and MoE variants were not yet tested in the PR summary. That makes this feature most relevant for interactive local inference, developer workstations, and low-concurrency deployments where one active stream matters more than aggregate server throughput.
mlx_lm.generate --model <path> --mtpandmlx_lm.server --model <path> --mtpare simple to try, but evaluation should include more than a quick benchmark.- Measure acceptance rate alongside tok/s, because throughput gains depend on how often draft tokens survive verification.
- Compare lower single-request latency against the loss of batch serving.
- Validate converted checkpoints before assuming the feature exists in your local model build.
Why the Reddit thread mattered
The thread added value beyond linking GitHub. It framed the benchmark in operator language, surfaced community reaction to whether ~80.6% acceptance is high enough to leave enabled by default, and connected mlx-lm to broader runtime work. A top Reddit comment pointed to a similar llama.cpp PR #20700, which suggests that Multi-Token Prediction is becoming a cross-runtime concern for local LLM inference rather than a one-off experiment.
That community signal is useful when deciding where to spend optimization effort. A feature that improves one user session can still be the wrong choice for a shared service if it disables batching, so the real question is not just “is MTP faster?” but “faster for which workload shape?” Upstream implementation details are in https://github.com/ml-explore/mlx-lm/pull/990, and the Reddit discussion that made the change visible to practitioners is at https://www.reddit.com/r/LocalLLaMA/comments/1rzntv5/multitoken_prediction_mtp_for_qwen35_is_coming_to/.
Related Articles
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
Google has released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, achieving up to 3x inference speedup through speculative decoding without any loss in output quality.
Google released Gemma 4 QAT checkpoints for edge devices and consumer GPUs. The mobile format cuts Gemma 4 E2B to a 1GB memory footprint while adding Q4_0 and ecosystem-ready weights.