LocalLLaMA warns against judging Gemma 4 too early while llama.cpp fixes are still landing
Original: Gemma 4 fixes in llama.cpp View original →
A fresh LocalLLaMA thread argues that some early criticism of Gemma 4 has more to do with the inference stack than the model itself. The post says many users are effectively judging Gemma through buggy runtime behavior rather than stable model support. That distinction matters because local-model launches are now tightly coupled to the speed at which community runtimes, quantization pipelines, and loaders catch up.
The author links multiple llama.cpp pull requests tied to Gemma 4 support, including PR #21418, PR #21390, and PR #21406. The poster says early issues included looping behavior in chat. After updates, the same user reports zero problems in OpenCode for non-coding tasks and suggests prompt changes may also reduce overthinking loops. The broader message is not that Gemma 4 is already fully solved, but that the first wave of bad impressions can be contaminated by parser, tokenizer, or attention-handling bugs in surrounding tooling.
The comments reinforce that operational view. One user says the fix is simply to update llama.cpp and reports around 60 tokens per second on a 4B model on an RTX 3070. Another says this pattern repeats almost every release: the model looks bad, inference bugs get fixed, then the model suddenly looks much better. That is a useful reminder for anyone benchmarking local models in real time, especially when community builds are moving faster than packaged releases.
What makes this thread valuable is that it shifts the evaluation lens from leaderboard headlines to system integrity. Local LLM quality is increasingly a stack question involving weights, quantization, runtime kernels, parser correctness, memory behavior, and prompt format. A model launch can look weak if any one of those layers is broken. The LocalLLaMA discussion is essentially a call to separate model quality from tooling lag before drawing hard conclusions.
Related Articles
PR #22673 merging Multi-Token Prediction support into llama.cpp has been accepted into master. The change brings the inference technique popularized by DeepSeek to the most widely used local LLM inference engine.
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
Comments (0)
No comments yet. Be the first to comment!