#kv-cache

LLM Hacker News Mar 26, 2026 2 min read

TurboQuant pushes KV cache compression into the center of LLM systems design

Google Research introduced TurboQuant on March 24, 2026 as a compression approach for KV cache and vector search bottlenecks. Hacker News pushed the post to 491 points and 129 comments, reflecting how central memory efficiency has become for long-context inference.

#quantization #kv-cache #inference

LLM Hacker News Mar 25, 2026 2 min read

Hacker News highlights TurboQuant's 3-bit KV-cache compression without retraining

Hacker News picked up Google Research's TurboQuant because it promises 3-bit KV-cache compression without fine-tuning while targeting both vector search and long-context inference.

#turboquant #quantization #kv-cache

LLM Reddit Feb 26, 2026 1 min read

Reddit Spotlights DeepSeek DualPath for KV-Cache I/O Bottlenecks in Agentic LLMs

A trending r/LocalLLaMA thread highlighted the DualPath paper on KV-Cache bottlenecks in disaggregated inference systems. The arXiv abstract reports up to 1.87x offline throughput and 1.96x average online throughput gains while meeting SLO.

#llm-inference #kv-cache #rdma

LLM Hacker News Feb 21, 2026 2 min read

HN Highlights CDLM: Block-Wise KV Caching and Step Reduction for Faster Diffusion LLM Inference

A high-score Hacker News discussion surfaced Together AI's CDLM post, which claims up to 14.5x latency improvements for diffusion language models by combining trajectory-consistent step reduction with exact block-wise KV caching.

#diffusion-language-models #llm-inference #kv-cache

LLM Reddit Feb 14, 2026 1 min read

r/LocalLLaMA Discusses NVIDIA DMS Claims on 8x KV Cache Efficiency

A February 13, 2026 post in r/LocalLLaMA highlighted NVIDIA Dynamic Memory Sparsification (DMS), claiming up to 8x KV cache memory savings without accuracy loss. Community discussion centered on inference cost, throughput, and what needs verification from primary technical sources.

#llm-inference #kv-cache #nvidia