r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

A community benchmark focused on prompt ingestion, not only generation

On March 22, 2026 UTC, a post on r/LocalLLaMA shared real-world numbers from a Lenovo ThinkStation P520 with a Xeon W-2295, 128GB DDR4 ECC, and an NVIDIA RTX PRO 4000 Blackwell 24GB. The setup ran Qwen 3.5 27B Q4_K_M for agentic coding with a 131,072-token context and q8_0/q4_0 KV cache. The reported result was striking: switching from mainline llama.cpp b8457 to ik_llama.cpp b4370 lifted prompt evaluation from roughly 43 tok/sec to 1,122 tok/sec, while generation moved from about 7.5 tok/sec to 26 tok/sec.

The post argues that the gain is not about changing model weights. The model stayed the same. What changed was the server implementation. According to the benchmark, mainline llama.cpp was splitting Qwen 3.5's hybrid Gated Delta Network and Mamba-style SSM path across 34 graph nodes with substantial CPU participation. The ik_llama.cpp fork instead uses fused GDN CUDA kernels and cuts graph splits from 34 to 2, leaving the CPU mostly idle and moving prompt processing fully onto the GPU.

Why this matters for local agent workflows

The discussion is useful because prompt ingestion is often the hidden bottleneck in coding assistants and other agentic tools. Long-context local workflows repeatedly re-read large codebases, plans, and tool traces. In that setting, faster prompt processing can matter more than raw decode speed. The original poster said the improvement makes 131K-context local agent work feel practical instead of painfully slow, which is a more operational metric than synthetic short-prompt benchmarks.

The post also includes an important caveat. Qwen 3.5's recurrent architecture still appears to trigger full prompt re-processing whenever the prompt changes, tracked in llama.cpp issue #20225. In a follow-up comment, the author said prompt ingestion still held above 950 tok/sec around 46K tokens, but generation slowed to roughly 20 tok/sec at that size versus 26 tok/sec around 10K tokens. So the fork reduces one major bottleneck without removing the architectural cost of long-session re-ingestion.

The practical takeaway

For people running Qwen 3.5 locally, the message from r/LocalLLaMA is straightforward: if you are judging the model through mainline llama.cpp alone, you may be benchmarking the runtime more than the model. The thread points to prebuilt Windows CUDA 12.8 binaries from the Thireus fork and describes the release as a drop-in replacement for llama-server with the same command-line arguments and the same OpenAI-compatible API surface.

Source: r/LocalLLaMA discussion. Related release: Thireus/ik_llama.cpp.

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

A community benchmark focused on prompt ingestion, not only generation

Why this matters for local agent workflows

The practical takeaway

Related Articles

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching

Related Articles

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
LLM Reddit Mar 20, 2026 2 min read

Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup
LLM Hacker News Mar 8, 2026 2 min read

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching
LLM Reddit Mar 31, 2026 2 min read