Skip to content
Decaying

LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags

Original: The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) View original →

Read in other languages: 한국어日本語
LLM Apr 16, 2026 By Insights AI (Reddit) 2 min read 42 views Source

A LocalLLaMA post about an LLM tuning its own llama.cpp flags landed because it had both a funny premise and real numbers. The author’s llm-server v2 adds --ai-tune, a loop where the model reads the available llama-server options, tries configurations, and caches the fastest result it finds.

The reported hardware is an unusual but very LocalLLaMA setup: 3090 Ti, 4070, 3060, and 128GB RAM. On that machine, the author says Qwen3.5-122B moved from 4.1 tok/s on plain llama-server to 11.2 tok/s with v1 tuning and 17.47 tok/s with v2 AI tuning. Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 25.94 tok/s, then to 40.05 tok/s. A gemma-4-31B UD-Q4_K_XL run moved from 14.2 tok/s to 24.77 tok/s.

The practical idea is to reduce the amount of tuning knowledge a user has to carry around. llama.cpp and ik_llama.cpp keep adding flags for offload, tensor splits, context handling, and MoE behavior. On multi-GPU systems, the right layer split or tensor placement can be hard to guess and easy to break when the runtime changes. By feeding llama-server --help into the loop, the author argues the tuner can notice new flags as they appear and add them to the search.

The comments were exactly the mix one would expect from LocalLLaMA. Some users asked for the before-and-after parameter sets rather than just throughput numbers. Others wanted ROCm or Vulkan support. Skeptics asked whether an LLM is needed at all, since a simpler search script might burn fewer tokens and be more deterministic. People who have manually tuned multi-GPU rigs were more sympathetic, especially around tensor split values that can take hours to dial in.

The result should not be read as a universal benchmark. It is one machine, one set of models, and a workflow that still needs constraints, reproducibility, and portability across backends. But the community signal is useful: local LLM performance is no longer just about downloading a better quant. The runtime configuration, hardware topology, cached launch settings, and the speed of finding a good combination are becoming part of the performance stack.

Share: Long

Related Articles