110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp
Original: 110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp View original →
The Achievement
A LocalLLaMA user shared benchmarks demonstrating 110 tokens/second running Qwen3.6 35B A3B on a single RTX 4070 Super 12GB using ik_llama.cpp — a fork by ikawrakow focused on CPU offload optimization. The result represents a practical inference speed for a 35B model on consumer hardware.
Why Switch from Upstream llama.cpp?
The user had solid MTP performance with llama.cpp until the Multi-Token Prediction PR merged into main, at which point performance dropped to barely above non-MTP speeds. Switching to ik_llama.cpp restored and surpassed prior performance. Comparative benchmarks: upstream llama.cpp achieves ~80-89 tok/s on the same hardware and quantization (byteshape's Qwen3.6-35B-A3B IQ4_XS-4.19bpw); ik_llama.cpp hits 110 tok/s.
System Specs
- GPU: RTX 4070 Super 12GB (CUDA 13.1.1)
- CPU: AMD Ryzen 7 9700X
- RAM: 48GB DDR5-6000 EXPO I
- OS: CachyOS with Plasma (X11)
Significance for Local AI
Running a 35B MoE model at 110 tok/s on a single consumer GPU demonstrates rapid advances in local inference. ik_llama.cpp's strength lies in its CPU offload optimization, making hybrid configurations — GPU VRAM plus system RAM — significantly more efficient than the upstream implementation.
Related Articles
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
LocalLLaMA reacted because --fit challenged the old rule of thumb that anything outside VRAM means painfully slow inference.
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
Comments (0)
No comments yet. Be the first to comment!