r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference

Original: Multi-Token Prediction (MTP) for qwen-3.5 is coming to mlx-lm View original →

Read in other languages: 한국어日本語
LLM Mar 21, 2026 By Insights AI (Reddit) 3 min read 1 views Source

What surfaced in r/LocalLLaMA

A Reddit post in r/LocalLLaMA with 99 points and 17 comments at capture time highlighted an open change to mlx-lm: PR #990, titled feat: native MTP speculative decoding for Qwen3.5. The thread mattered because it translated a code-level change into numbers that local inference operators immediately understand. The headline benchmark was 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. For people running Apple Silicon setups, that is not just an implementation curiosity. It suggests that a runtime feature, rather than a model replacement, may materially reduce latency for interactive generation.

What the upstream PR actually adds

The PR summary says Qwen3.5 checkpoints include a built-in Multi-Token Prediction head, exposed through mtp_num_hidden_layers: 1. That head predicts token t+2 from the backbone hidden state at t and the embedding of token t+1. The practical implication is important: mlx-lm can use native speculative decoding without a separate draft model. Instead of serving and synchronizing two models, the runtime spends only one extra transformer layer of compute to draft additional tokens before verification.

  • Model support is added in qwen3_5.py.
  • generate.py adds an explicit --mtp flag.
  • server.py exposes the same --mtp path for serving.
  • The PR also adds cache rollback support and 8 unit tests.

The verification path is a key detail for practitioners. The generation loop proposes drafts, verifies them, and on rejection rolls back SSM state and trims KV cache entries. That sounds like an internal implementation detail, but it is exactly what determines whether speculative decoding remains correct under rejection and therefore whether it is safe to deploy beyond a benchmark script.

Why the latency tradeoff matters

The appeal is straightforward. Native MTP is operationally simpler than two-model speculative decoding, it is available from both generation and server entry points, and the reported single-stream gain on a Mac is large enough to matter. But the limitations are just as important. The PR is still open, it requires checkpoints converted with MTP weights preserved, batching is disabled when MTP is active, and MoE variants were not yet tested in the PR summary. That makes this feature most relevant for interactive local inference, developer workstations, and low-concurrency deployments where one active stream matters more than aggregate server throughput.

  • mlx_lm.generate --model <path> --mtp and mlx_lm.server --model <path> --mtp are simple to try, but evaluation should include more than a quick benchmark.
  • Measure acceptance rate alongside tok/s, because throughput gains depend on how often draft tokens survive verification.
  • Compare lower single-request latency against the loss of batch serving.
  • Validate converted checkpoints before assuming the feature exists in your local model build.

Why the Reddit thread mattered

The thread added value beyond linking GitHub. It framed the benchmark in operator language, surfaced community reaction to whether ~80.6% acceptance is high enough to leave enabled by default, and connected mlx-lm to broader runtime work. A top Reddit comment pointed to a similar llama.cpp PR #20700, which suggests that Multi-Token Prediction is becoming a cross-runtime concern for local LLM inference rather than a one-off experiment.

That community signal is useful when deciding where to spend optimization effort. A feature that improves one user session can still be the wrong choice for a shared service if it disables batching, so the real question is not just “is MTP faster?” but “faster for which workload shape?” Upstream implementation details are in https://github.com/ml-explore/mlx-lm/pull/990, and the Reddit discussion that made the change visible to practitioners is at https://www.reddit.com/r/LocalLLaMA/comments/1rzntv5/multitoken_prediction_mtp_for_qwen35_is_coming_to/.

Share: Long

Related Articles

LLM Reddit 4d ago 2 min read

A high-engagement r/LocalLLaMA post highlighted Unsloth Studio, a beta open-source web UI that aims to train, run, and export open models from one local interface. The discussion framed it as a possible LM Studio challenger in the GGUF ecosystem, while top commenters noted that many advanced users still lean on vLLM or direct llama.cpp workflows.

LLM Reddit Mar 14, 2026 2 min read

A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.