#kv-cache

LLM Reddit Apr 26, 2026 1 min read

KV cache 양자화, Gemma 4가 먼저 흔들린 이유

LocalLLaMA가 이 글에 반응한 이유는 q8_0 KV cache가 늘 무난하다는 통념이 깨졌기 때문이다. Gemma 4는 Qwen 3.6보다 훨씬 빨리 품질이 무너졌고, 스레드는 바로 SWA cache와 긴 context 영향으로 옮겨갔다.

#kv-cache #quantization #gemma-4

LLM Reddit Apr 25, 2026 2 min read

q8_0이면 거의 공짜라는 통념, LocalLLaMA가 깨뜨린 KV 캐시 데이터

LocalLLaMA가 반응한 이유는 단순한 수치 비교가 아니었다. 많은 로컬 추론 사용자가 사실상 상식처럼 받아들이던 규칙을 정면으로 건드렸고, 특히 Gemma 쪽에서 모델별 차이가 크다는 점을 보여 줬기 때문이다. 2026년 4월 25일 크롤링 시점 기준 스레드는 324점, 58댓글이었다.

#kv-cache #gemma #qwen

LLM Reddit Apr 4, 2026 1 min read

LocalLLaMA, RTX 5090 한 장에서 Gemma 4 31B 256K context benchmark 공개

`r/LocalLLaMA`의 benchmark post는 TurboQuant KV cache compression으로 RTX 5090 한 장에서 Gemma 4 31B를 256K context까지 밀어올렸다고 주장한다. 속도 수치와 함께 VRAM 사용량, Windows/MSVC build fix, KV quant 품질 우려까지 같이 제시된 점이 눈에 띈다.

#gemma4 #llama.cpp #kv-cache

LLM Reddit Apr 2, 2026 1 min read

Reddit가 주목한 llama.cpp의 attn-rot, 저비용 quantization 개선

r/LocalLLaMA는 llama.cpp PR #21038 병합 소식을 빠르게 끌어올리며, Hadamard 기반 회전으로 Q, K, V를 처리하는 방식이 TurboQuant 계열 이득을 더 낮은 마찰로 가져올 수 있다고 보고 있다. 포인트는 새 quantization format 없이 기존 스택에 붙는다는 점이다.

#llama.cpp #turboquant #kv-cache

LLM Hacker News Apr 2, 2026 1 min read

Hacker News가 다시 짚은 long-context LLM의 KV cache 비용

Hacker News는 KV cache를 추상적 architecture 용어가 아니라 GPU memory 비용 문제로 설명한 Future Shock 글을 다시 끌어올렸다. 이 설명은 GPT-2에서 Llama 3, DeepSeek V3, Gemma 3, Mamba 계열까지 memory 설계가 어떻게 달라졌는지 한 흐름으로 보여 준다.

#kv-cache #inference #transformers

LLM Reddit Apr 1, 2026 1 min read

Reddit가 주목한 llama.cpp의 attn-rot, KV cache quantization 품질을 싸게 끌어올릴까

LocalLLaMA에서 화제가 된 attn-rot는 Hadamard rotation으로 Q, K, V를 회전시켜 KV cache quantization 품질을 높이려는 llama.cpp PR이다. 새로운 format을 만들지 않고도 perplexity를 크게 줄일 수 있다는 점이 핵심이다.

#llama.cpp #quantization #kv-cache

LLM Reddit Mar 29, 2026 2 min read

Reddit가 주목한 TurboQuant, 정확도 손실 없이 3-bit KV cache 압축을 노리는 Google 접근

2026년 3월 r/singularity에서 공유된 Google Research의 TurboQuant 글은 114 points와 18 comments를 얻었다. Google은 이 방법이 needle 계열 작업에서 KV cache 메모리를 최소 6배 줄이고, 학습 없이 3-bit cache 압축과 H100 기준 최대 8배 attention-logit 속도 향상을 보여준다고 설명한다.

#quantization #kv-cache #vector-search

LLM Reddit Mar 29, 2026 1 min read

r/LocalLLaMA가 압축한 TurboQuant의 핵심, rotate한 뒤 quantize하기

점수가 높은 r/LocalLLaMA 글은 TurboQuant를 polar coordinates가 아니라 random rotation 이후 quantization이라는 직관으로 설명했다. 링크된 arXiv paper는 near-optimal distortion rate, residual QJL, 그리고 KV cache에서 3.5 bits per channel quality neutrality를 주장한다.

#turboquant #quantization #kv-cache

LLM Reddit Mar 28, 2026 1 min read

r/LocalLLaMA가 주목한 TurboQuant on MLX, KV cache compression이 FP16 speed에 근접

r/LocalLLaMA에서 주목받은 March 28, 2026 게시물은 TurboQuant KV cache compression을 MLX와 custom Metal kernel에 이식한 구현 기록이다. 작성자는 Qwen2.5-32B on M4 Pro 48GB에서 4.6x compression과 0.98x FP16 speed를 주장했지만, repo README의 7B 수치는 더 보수적이어서 실제 이득이 model과 integration 방식에 크게 좌우된다는 점도 함께 드러난다.

#mlx #kv-cache #metal

LLM Reddit Mar 28, 2026 1 min read

r/LocalLLaMA가 주목한 TurboQuant, KV cache 압축으로 local LLM 한계 낮추나

r/LocalLLaMA에서 주목받은 TurboQuant는 KV cache를 3-bit로 압축해 memory 사용량을 최소 6배 줄일 수 있다는 Google Research 결과를 다시 끌어올렸다. 관건은 이 기법이 실제 local inference stack에 통합돼 long-context 성능과 운영 비용을 얼마나 바꿀 수 있느냐다.

#compression #kv-cache #quantization

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA가 주목한 TurboQuant 구현, sparse V dequant로 32K decode 22.8% 개선

LocalLLaMA self-post는 attention weight가 무시 가능한 위치에서 V dequant를 건너뛰는 sparse V dequant 기법을 공개하며, llama.cpp 기반 TurboQuant 구현에서 32K context decode를 22.8% 끌어올렸다고 주장했다. Qwen3.5-35B-A3B와 Apple M5 Max 기준으로 perplexity는 유지됐고 NIAH는 7/9에서 9/9로 개선됐다는 설명이다.

#llm-inference #kv-cache #llama-cpp

LLM Reddit Mar 27, 2026 1 min read

LocalLLaMA가 주목한 RotorQuant, KV cache compression을 Clifford rotors로 다시 쓰다

Reddit thread는 TurboQuant의 dense rotation을 더 구조적인 rotor math로 바꾸면 attention fidelity를 크게 잃지 않으면서 kernel cost를 낮출 수 있다는 주장에 반응했다.

#rotorquant #quantization #kv-cache