#kv-cache

LLM Reddit 1d ago 2 min read

LocalLLaMA Spots a Quantization Trap: Gemma 4 Breaks Sooner Than Qwen 3.6

LocalLLaMA paid attention because this post breaks a default assumption: q8_0 KV cache is not “practically lossless” for every model. Gemma 4 degrades much earlier than Qwen 3.6, and the thread quickly moved into SWA cache and long-context implications.

#kv-cache #quantization #gemma-4

LLM Reddit 2d ago 2 min read

LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

LocalLLaMA reacted because the post did not just tweak a benchmark table. It went after a widely repeated local-inference assumption and showed that the answer changes sharply by model family, especially for Gemma. By crawl time on April 25, 2026, the thread had 324 points and 58 comments.

#kv-cache #gemma #qwen

LLM Reddit Apr 4, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090

A `r/LocalLLaMA` benchmark claims Gemma 4 31B can run at 256K context on a single RTX 5090 using TurboQuant KV cache compression. The post is notable because it pairs performance numbers with detailed build notes, VRAM measurements, and community skepticism about long-context quality under heavy KV quantization.

#gemma4 #llama.cpp #kv-cache

LLM Reddit Apr 2, 2026 2 min read

Reddit tracks attn-rot landing in llama.cpp as a low-cost quantization upgrade

r/LocalLLaMA is highlighting the merge of llama.cpp PR #21038, which applies a simple Hadamard-based rotation to Q, K, and V in attention as a lightweight path toward TurboQuant-like gains. The appeal is that it improves low-bit cache behavior without introducing a brand-new quantization format.

#llama.cpp #turboquant #kv-cache

LLM Hacker News Apr 2, 2026 2 min read

Hacker News revisits the KV cache trade-offs behind long-context LLMs

A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.

#kv-cache #inference #transformers

LLM Reddit Apr 1, 2026 2 min read

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.

#llama.cpp #quantization #kv-cache

LLM Reddit Mar 29, 2026 3 min read

Reddit Spots TurboQuant as Google Targets 3-Bit KV Cache Compression Without Accuracy Loss

A March 2026 r/singularity post shared Google Research’s TurboQuant work and drew 114 points with 18 comments. Google says the method can shrink KV cache memory by at least 6x on needle tasks, quantize caches to 3 bits without training, and deliver up to 8x attention-logit speedups on H100 GPUs.

#quantization #kv-cache #vector-search

LLM Reddit Mar 29, 2026 2 min read

r/LocalLLaMA compresses TurboQuant into one idea: rotate first, quantize second

A high-scoring r/LocalLLaMA post explains TurboQuant not as a polar-coordinates trick but as random rotation before quantization. The linked arXiv paper claims near-optimal distortion rates, a residual QJL stage for inner products, and quality-neutral KV cache quantization at 3.5 bits per channel.

#turboquant #quantization #kv-cache

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

#mlx #kv-cache #metal

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA focuses on TurboQuant’s attempt to shrink KV cache bottlenecks

A popular r/LocalLLaMA post revived attention around Google Research’s TurboQuant by tying it directly to local inference constraints. The method’s reported 3-bit KV cache compression and 6x memory reduction make it relevant well beyond research headlines, but its practical value will depend on whether it reaches real deployment stacks.

#compression #kv-cache #quantization

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.

#llm-inference #kv-cache #llama-cpp

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA Debates RotorQuant as a Cheaper KV Cache Compression Path

The Reddit thread focused on a practical claim with real systems implications: replace TurboQuant's dense rotation with structured rotor math, keep attention fidelity close, and make the kernel much cheaper on NVIDIA and Apple hardware.

#rotorquant #quantization #kv-cache