A Gemma 4 26B User Pushes Local Context to 245K Tokens

Original: Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! View original →

Read in other languages: 한국어日本語
LLM Apr 12, 2026 By Insights AI (Reddit) 2 min read 1 views Source

What the post claims

This r/LocalLLaMA post, which had 161 score and 71 comments on April 12, 2026, documents an aggressive long-context stress test for Gemma 4 26B A4B. The author says they packed the context with Reddit posts, documentation, and raw llama.cpp files to push VRAM usage and retrieval behavior, then checked whether the model could still recover specific details accurately. Their headline result is that the model remained usable at 245,283 out of 262,144 context tokens, or roughly 94% of the configured window.

What makes the post more useful than a generic boast is that it also describes where the model broke. According to the author, once the session moved beyond 100K context, Gemma sometimes fell into self-questioning loops and kept extending its own reasoning instead of delivering a clean answer. Lowering temperature and raising repeat penalty to 1.17 or 1.18 reportedly improved stability, and the author says the model could then retrieve a specific user statement from the oversized context within about two to five seconds.

Practical settings shared in the thread

  • The setup used a 262144 context size and 99 GPU layers.
  • Sampling settings included top_p 0.95, top_k 40, min_p 0.05, and repeat_penalty 1.17.
  • Batch and microbatch were both set to 512, with 2048 MB of cache RAM.
  • The author says they were using the latest llama.cpp build and the newest Unsloth GGUF release available at the time.

Why the report matters

This is still an anecdotal community report, not a formal benchmark with reproducibility guarantees. Even so, it captures the kind of operational detail local-model users care about most: where long-context behavior starts to degrade, which tuning knobs reduced looping, and how much of the advertised context window remains practically useful. In a market full of headline context numbers, those implementation notes are often more valuable than the headline itself.

Original source: r/LocalLLaMA post.

Share: Long

Related Articles

LLM Reddit 2d ago 2 min read

A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.