110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

The Achievement

A LocalLLaMA user shared benchmarks demonstrating 110 tokens/second running Qwen3.6 35B A3B on a single RTX 4070 Super 12GB using ik_llama.cpp — a fork by ikawrakow focused on CPU offload optimization. The result represents a practical inference speed for a 35B model on consumer hardware.

Why Switch from Upstream llama.cpp?

The user had solid MTP performance with llama.cpp until the Multi-Token Prediction PR merged into main, at which point performance dropped to barely above non-MTP speeds. Switching to ik_llama.cpp restored and surpassed prior performance. Comparative benchmarks: upstream llama.cpp achieves ~80-89 tok/s on the same hardware and quantization (byteshape's Qwen3.6-35B-A3B IQ4_XS-4.19bpw); ik_llama.cpp hits 110 tok/s.

System Specs

GPU: RTX 4070 Super 12GB (CUDA 13.1.1)
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I
OS: CachyOS with Plasma (X11)

Significance for Local AI

Running a 35B MoE model at 110 tok/s on a single consumer GPU demonstrates rapid advances in local inference. ik_llama.cpp's strength lies in its CPU offload optimization, making hybrid configurations — GPU VRAM plus system RAM — significantly more efficient than the upstream implementation.

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.

#qwen #local-llm #llama-cpp

LLM Reddit 6d ago 1 min read

GLM5.2 at home turns local LLM enthusiasm into a hardware bill

A LocalLLaMA build with five RTX PRO 6000 cards and a 5090 made the practical cost of serious local inference hard to ignore.

#glm #local-llm #gpu

LLM Reddit Feb 26, 2026 2 min read

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds

A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.