Skip to content

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

Original: 110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp View original →

Read in other languages: 한국어日本語
LLM May 22, 2026 By Insights AI (Reddit) 1 min read 1 views Source

The Achievement

A LocalLLaMA user shared benchmarks demonstrating 110 tokens/second running Qwen3.6 35B A3B on a single RTX 4070 Super 12GB using ik_llama.cpp — a fork by ikawrakow focused on CPU offload optimization. The result represents a practical inference speed for a 35B model on consumer hardware.

Why Switch from Upstream llama.cpp?

The user had solid MTP performance with llama.cpp until the Multi-Token Prediction PR merged into main, at which point performance dropped to barely above non-MTP speeds. Switching to ik_llama.cpp restored and surpassed prior performance. Comparative benchmarks: upstream llama.cpp achieves ~80-89 tok/s on the same hardware and quantization (byteshape's Qwen3.6-35B-A3B IQ4_XS-4.19bpw); ik_llama.cpp hits 110 tok/s.

System Specs

  • GPU: RTX 4070 Super 12GB (CUDA 13.1.1)
  • CPU: AMD Ryzen 7 9700X
  • RAM: 48GB DDR5-6000 EXPO I
  • OS: CachyOS with Plasma (X11)

Significance for Local AI

Running a 35B MoE model at 110 tok/s on a single consumer GPU demonstrates rapid advances in local inference. ik_llama.cpp's strength lies in its CPU offload optimization, making hybrid configurations — GPU VRAM plus system RAM — significantly more efficient than the upstream implementation.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment