A Gemma 4 26B User Pushes Local Context to 245K Tokens
Original: Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! View original →
What the post claims
This r/LocalLLaMA post, which had 161 score and 71 comments on April 12, 2026, documents an aggressive long-context stress test for Gemma 4 26B A4B. The author says they packed the context with Reddit posts, documentation, and raw llama.cpp files to push VRAM usage and retrieval behavior, then checked whether the model could still recover specific details accurately. Their headline result is that the model remained usable at 245,283 out of 262,144 context tokens, or roughly 94% of the configured window.
What makes the post more useful than a generic boast is that it also describes where the model broke. According to the author, once the session moved beyond 100K context, Gemma sometimes fell into self-questioning loops and kept extending its own reasoning instead of delivering a clean answer. Lowering temperature and raising repeat penalty to 1.17 or 1.18 reportedly improved stability, and the author says the model could then retrieve a specific user statement from the oversized context within about two to five seconds.
Practical settings shared in the thread
- The setup used a 262144 context size and 99 GPU layers.
- Sampling settings included
top_p0.95,top_k40,min_p0.05, andrepeat_penalty1.17. - Batch and microbatch were both set to 512, with 2048 MB of cache RAM.
- The author says they were using the latest
llama.cppbuild and the newest Unsloth GGUF release available at the time.
Why the report matters
This is still an anecdotal community report, not a formal benchmark with reproducibility guarantees. Even so, it captures the kind of operational detail local-model users care about most: where long-context behavior starts to degrade, which tuning knobs reduced looping, and how much of the advertised context window remains practically useful. In a market full of headline context numbers, those implementation notes are often more valuable than the headline itself.
Original source: r/LocalLLaMA post.
Related Articles
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.
Comments (0)
No comments yet. Be the first to comment!