LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090
Original: Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark View original →
A r/LocalLLaMA benchmark post drew attention because it tackled a question local-model users keep running into: how far can Gemma 4 context length be pushed on a single consumer GPU if KV cache compression is aggressive enough? The author reports running gemma-4-31B-it-UD-Q4_K_XL at a full 256K context on one RTX 5090 by using TurboQuant KV cache compression in a custom llama.cpp fork.
The setup is specific and unusually transparent. The post lists an RTX 5090 with 32GB VRAM, a Ryzen 9 9950X3D, 64GB DDR5, and a Windows 11 build based on TheTom/llama-cpp-turboquant merged with recent Gemma 4 support. The KV cache uses the turbo3 mode, described as roughly 4.5x compression versus f16. The author reports that VRAM usage at 262K context reached 27.7GB, leaving about 4.3GB of headroom on the card.
- Prompt processing reportedly measured 3,362.71 tokens/s at 4K context and 899.55 tokens/s at 262K context.
- Token generation was reported at 61.51 tokens/s.
- The author says 256K context would be impractical on 32GB VRAM without the compressed KV cache.
- The post also documents Windows/MSVC build fixes needed to get the fork building correctly for Gemma 4.
The post is valuable because it mixes benchmark data with engineering caveats. The author explicitly notes thermal throttling at 575W, ties the performance curve to quadratic attention costs during prompt processing, and distinguishes that from generation speed, which they describe as memory-bandwidth bound. There is also a low-level debugging note about a std::transform issue with GGUF bool arrays in Release builds that affected Gemma 4's sliding-window attention pattern.
Commenters were interested but skeptical in the right way. The top replies asked how badly quality might degrade under heavy KV quantization and whether the model can still reliably retrieve long-context information after 256K tokens, which is the real test for this class of optimization. That makes the thread useful beyond bragging rights: it is an example of the local LLM community pushing from it fits toward it still works, while sharing enough config and failure details for others to reproduce or challenge the result.
Related Articles
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
Comments (0)
No comments yet. Be the first to comment!