r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion
Original: ik_llama.cpp gives 26x faster prompt processing on Qwen 3.5 27B — real world numbers View original →
A community benchmark focused on prompt ingestion, not only generation
On March 22, 2026 UTC, a post on r/LocalLLaMA shared real-world numbers from a Lenovo ThinkStation P520 with a Xeon W-2295, 128GB DDR4 ECC, and an NVIDIA RTX PRO 4000 Blackwell 24GB. The setup ran Qwen 3.5 27B Q4_K_M for agentic coding with a 131,072-token context and q8_0/q4_0 KV cache. The reported result was striking: switching from mainline llama.cpp b8457 to ik_llama.cpp b4370 lifted prompt evaluation from roughly 43 tok/sec to 1,122 tok/sec, while generation moved from about 7.5 tok/sec to 26 tok/sec.
The post argues that the gain is not about changing model weights. The model stayed the same. What changed was the server implementation. According to the benchmark, mainline llama.cpp was splitting Qwen 3.5's hybrid Gated Delta Network and Mamba-style SSM path across 34 graph nodes with substantial CPU participation. The ik_llama.cpp fork instead uses fused GDN CUDA kernels and cuts graph splits from 34 to 2, leaving the CPU mostly idle and moving prompt processing fully onto the GPU.
Why this matters for local agent workflows
The discussion is useful because prompt ingestion is often the hidden bottleneck in coding assistants and other agentic tools. Long-context local workflows repeatedly re-read large codebases, plans, and tool traces. In that setting, faster prompt processing can matter more than raw decode speed. The original poster said the improvement makes 131K-context local agent work feel practical instead of painfully slow, which is a more operational metric than synthetic short-prompt benchmarks.
The post also includes an important caveat. Qwen 3.5's recurrent architecture still appears to trigger full prompt re-processing whenever the prompt changes, tracked in llama.cpp issue #20225. In a follow-up comment, the author said prompt ingestion still held above 950 tok/sec around 46K tokens, but generation slowed to roughly 20 tok/sec at that size versus 26 tok/sec around 10K tokens. So the fork reduces one major bottleneck without removing the architectural cost of long-session re-ingestion.
The practical takeaway
For people running Qwen 3.5 locally, the message from r/LocalLLaMA is straightforward: if you are judging the model through mainline llama.cpp alone, you may be benchmarking the runtime more than the model. The thread points to prebuilt Windows CUDA 12.8 binaries from the Thireus fork and describes the release as a drop-in replacement for llama-server with the same command-line arguments and the same OpenAI-compatible API surface.
Source: r/LocalLLaMA discussion. Related release: Thireus/ik_llama.cpp.
Related Articles
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
A Hacker News post surfaced Unsloth's Qwen3.5 local guide, which lays out memory targets, reasoning-mode controls, and llama.cpp commands for running 27B and 35B-A3B models on local hardware.
A popular r/LocalLLaMA post highlighted a community merge of uncensored and reasoning-distilled Qwen 3.5 9B checkpoints, underscoring the appetite for behavior-tuned small local models.
Comments (0)
No comments yet. Be the first to comment!