#llm-architecture

LLM Hacker News Apr 2, 2026 2 min read

Hacker News revisits the KV cache trade-offs behind long-context LLMs

A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.

#kv-cache #inference #transformers

LLM Reddit Mar 18, 2026 2 min read

r/MachineLearning highlights Attention Residuals as Kimi targets fixed-sum PreNorm bottlenecks

A Reddit thread surfaced Kimi's AttnRes paper, which argues that fixed residual accumulation in PreNorm LLMs dilutes deeper layers. The proposed attention-based residual path and its block variant aim to keep the gains without exploding memory cost.

#kimi #llm-architecture #attention