Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP
Original: 80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP View original →
The Achievement
A post on r/LocalLLaMA has attracted significant attention for achieving over 80 tokens per second and 128K context with Qwen3.6 35B A3B on a 12GB VRAM consumer GPU. Running a 35-billion parameter model at this speed on entry-level hardware would have been impractical just months ago.
The Key: llama.cpp MTP
The breakthrough comes from Multi-Token Prediction (MTP) support in a recent llama.cpp pull request. MTP uses a draft model to predict multiple tokens ahead, then has the main model verify them in batch - achieving an 80%+ draft acceptance rate. This dramatically improves real throughput without degrading output quality.
Configuration Highlights
- Model: Qwen3.6 35B A3B (quantized)
- Context: 128K tokens
- Speed: 80+ tokens/sec
- Draft acceptance rate: 80%+
- Required VRAM: 12GB
Why It Matters
12GB VRAM corresponds to consumer GPUs in the RTX 3060-4070 class. Reaching practical speeds with a 35B model on this tier of hardware marks meaningful progress for local AI democratization.
Related Articles
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
Comments (0)
No comments yet. Be the first to comment!