LocalLLaMA, RTX 5090 한 장에서 Gemma 4 31B 256K context benchmark 공개

r/LocalLLaMA의 benchmark post가 주목받은 이유는 local model 사용자들이 반복해서 부딪히는 질문을 정면으로 다뤘기 때문이다. KV cache compression을 강하게 쓰면 consumer GPU 한 장에서 Gemma 4 context length를 어디까지 밀어올릴 수 있는가? 작성자는 custom llama.cpp fork와 TurboQuant KV cache를 이용해 RTX 5090 한 장에서 gemma-4-31B-it-UD-Q4_K_XL을 256K full context로 돌렸다고 보고했다.

post는 setup을 unusually transparent하게 공개했다. GPU는 32GB VRAM의 RTX 5090, CPU는 Ryzen 9 9950X3D, 메모리는 64GB DDR5, OS는 Windows 11이다. build는 TheTom/llama-cpp-turboquant branch에 최신 Gemma 4 지원을 합친 형태라고 설명한다. KV cache는 turbo3 모드를 사용했고, 작성자는 이를 f16 대비 약 4.5배 압축이라고 소개했다. 262K context에서 VRAM 사용량은 27.7GB로, 카드에 약 4.3GB headroom이 남았다고 적었다.

prompt processing은 4K context에서 3,362.71 tokens/s, 262K context에서 899.55 tokens/s라고 보고됐다.
token generation 속도는 61.51 tokens/s였다.
작성자는 compressed KV cache 없이는 32GB VRAM에서 256K context가 사실상 불가능하다고 봤다.
또 Gemma 4를 위해 필요한 Windows/MSVC build fix도 함께 적었다.

이 post의 가치는 benchmark 숫자만이 아니라 engineering caveat를 함께 준다는 데 있다. 작성자는 575W 구간에서 thermal throttling이 있었다고 인정했고, prompt processing 속도 저하는 quadratic attention cost와 연결해 설명했다. 반면 generation speed는 memory bandwidth bound라고 구분했다. 여기에 GGUF bool array를 읽는 std::transform 관련 Release build 이슈까지 적어, Gemma 4의 sliding-window attention pattern이 어디서 깨졌는지도 공유했다.

댓글도 적절하게 회의적이었다. 상위 반응은 KV quant를 이렇게 강하게 걸었을 때 품질이 얼마나 무너지느냐와, 256K를 넣은 뒤에도 실제로 필요한 long-context recall이 유지되느냐를 물었다. 그래서 이 쓰레드는 단순한 자랑보다 낫다. local LLM 커뮤니티가 it fits에서 멈추지 않고 it still works까지 검증하려는 방향을 보여주고, 다른 사용자가 재현하거나 반박할 수 있을 만큼 설정과 failure detail을 남겼기 때문이다.

LocalLLaMA, RTX 5090 한 장에서 Gemma 4 31B 256K context benchmark 공개

Related Articles

LocalLLaMA가 들썩인 Gemma-4 audio 지원, llama-server에서 STT가 바로 돈다

r/LocalLLaMA: M1 Max에서 MLX와 llama.cpp의 실제 지연 시간 차이를 검증

Reddit가 주목한 llama.cpp의 attn-rot, 저비용 quantization 개선

Related Articles

LocalLLaMA가 들썩인 Gemma-4 audio 지원, llama-server에서 STT가 바로 돈다
LLM Reddit Apr 15, 2026 1 min read

r/LocalLLaMA: M1 Max에서 MLX와 llama.cpp의 실제 지연 시간 차이를 검증
LLM Reddit Mar 14, 2026 1 min read

Reddit가 주목한 llama.cpp의 attn-rot, 저비용 quantization 개선
LLM Reddit Apr 2, 2026 1 min read