Google has released open-weight MTP drafter models for Gemma 4 31B and 26B-A4B, enabling speculative decoding to significantly boost inference speed without affecting output quality.
#mtp
RSS FeedLLM Hacker News 2h ago 1 min read
LLM Reddit 1d ago 1 min read
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
LLM Reddit Mar 21, 2026 3 min read
A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.