LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags
Original: The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) View original →
A LocalLLaMA post about an LLM tuning its own llama.cpp flags landed because it had both a funny premise and real numbers. The author’s llm-server v2 adds --ai-tune, a loop where the model reads the available llama-server options, tries configurations, and caches the fastest result it finds.
The reported hardware is an unusual but very LocalLLaMA setup: 3090 Ti, 4070, 3060, and 128GB RAM. On that machine, the author says Qwen3.5-122B moved from 4.1 tok/s on plain llama-server to 11.2 tok/s with v1 tuning and 17.47 tok/s with v2 AI tuning. Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 25.94 tok/s, then to 40.05 tok/s. A gemma-4-31B UD-Q4_K_XL run moved from 14.2 tok/s to 24.77 tok/s.
The practical idea is to reduce the amount of tuning knowledge a user has to carry around. llama.cpp and ik_llama.cpp keep adding flags for offload, tensor splits, context handling, and MoE behavior. On multi-GPU systems, the right layer split or tensor placement can be hard to guess and easy to break when the runtime changes. By feeding llama-server --help into the loop, the author argues the tuner can notice new flags as they appear and add them to the search.
The comments were exactly the mix one would expect from LocalLLaMA. Some users asked for the before-and-after parameter sets rather than just throughput numbers. Others wanted ROCm or Vulkan support. Skeptics asked whether an LLM is needed at all, since a simpler search script might burn fewer tokens and be more deterministic. People who have manually tuned multi-GPU rigs were more sympathetic, especially around tensor split values that can take hours to dial in.
The result should not be read as a universal benchmark. It is one machine, one set of models, and a workflow that still needs constraints, reproducibility, and portability across backends. But the community signal is useful: local LLM performance is no longer just about downloading a better quant. The runtime configuration, hardware topology, cached launch settings, and the speed of finding a good combination are becoming part of the performance stack.
Related Articles
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.
Comments (0)
No comments yet. Be the first to comment!