LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags
Original: The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) View original →
A LocalLLaMA post about an LLM tuning its own llama.cpp flags landed because it had both a funny premise and real numbers. The author’s llm-server v2 adds --ai-tune, a loop where the model reads the available llama-server options, tries configurations, and caches the fastest result it finds.
The reported hardware is an unusual but very LocalLLaMA setup: 3090 Ti, 4070, 3060, and 128GB RAM. On that machine, the author says Qwen3.5-122B moved from 4.1 tok/s on plain llama-server to 11.2 tok/s with v1 tuning and 17.47 tok/s with v2 AI tuning. Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 25.94 tok/s, then to 40.05 tok/s. A gemma-4-31B UD-Q4_K_XL run moved from 14.2 tok/s to 24.77 tok/s.
The practical idea is to reduce the amount of tuning knowledge a user has to carry around. llama.cpp and ik_llama.cpp keep adding flags for offload, tensor splits, context handling, and MoE behavior. On multi-GPU systems, the right layer split or tensor placement can be hard to guess and easy to break when the runtime changes. By feeding llama-server --help into the loop, the author argues the tuner can notice new flags as they appear and add them to the search.
The comments were exactly the mix one would expect from LocalLLaMA. Some users asked for the before-and-after parameter sets rather than just throughput numbers. Others wanted ROCm or Vulkan support. Skeptics asked whether an LLM is needed at all, since a simpler search script might burn fewer tokens and be more deterministic. People who have manually tuned multi-GPU rigs were more sympathetic, especially around tensor split values that can take hours to dial in.
The result should not be read as a universal benchmark. It is one machine, one set of models, and a workflow that still needs constraints, reproducibility, and portability across backends. But the community signal is useful: local LLM performance is no longer just about downloading a better quant. The runtime configuration, hardware topology, cached launch settings, and the speed of finding a good combination are becoming part of the performance stack.
Related Articles
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.