#mtp

LLM Reddit 4h ago 3 min read

A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.

© 2026 Insights. All rights reserved.