r/MachineLearning highlights Attention Residuals as Kimi targets fixed-sum PreNorm bottlenecks

A Reddit thread in r/MachineLearning pushed Kimi Team’s Attention Residuals paper into wider view, drawing 67 upvotes and 10 comments around a deceptively simple question: what if one of the weaknesses in modern LLMs is not attention itself, but the way residual paths blindly accumulate prior layer outputs? The linked arXiv paper, 2603.15031, argues that standard PreNorm residual connections sum all earlier layer outputs with fixed unit weight. Over depth, that can cause hidden-state growth and dilute the contribution of individual layers.

A different way to handle residual paths

The proposed fix is Attention Residuals (AttnRes). Instead of adding previous layer outputs with uniform weight, AttnRes lets a layer apply softmax attention over preceding states and selectively aggregate them using learned, input-dependent weights. In other words, the model does not just inherit every earlier representation equally; it can choose which prior layers matter more for the current input. The authors present this as a way to reduce the fixed-sum behavior that makes deep PreNorm stacks harder to manage.

The paper also addresses the obvious systems objection: attending over every previous layer can be expensive. To keep the mechanism practical at scale, the authors introduce Block AttnRes, which works with block-level representations to reduce memory and communication overhead while preserving most of the gains of the full method. That matters because architecture ideas often look elegant on paper and then fall apart when training cost enters the room. Here, the implementation story is part of the pitch, not an afterthought.

Why the Reddit thread mattered

The post notes that the method was integrated into Kimi Linear, using a 48B-parameter model with 3B activated parameters, and pre-trained on 1.4T tokens. According to the paper, the change improved downstream performance, made output magnitudes more uniform across depth, and produced a better gradient distribution. In the comments, one reader said the idea felt intuitive and was surprising mainly because it had not been explored more aggressively before. Another asked the classic architecture question: how much of the gain comes from the specific residual redesign, and how much might simply come from extra parameters or a different compute budget?

That exchange captures why the thread is useful. It points to a live research theme in LLM design: some of the next gains may come from revisiting the supposedly settled plumbing of transformer stacks rather than only scaling context length or post-training recipes. Residual connections are so foundational that they often disappear into the background. AttnRes brings them back into focus as an active source of model behavior, which is exactly why r/MachineLearning treated the paper as more than another incremental transformer tweak.

r/MachineLearning highlights Attention Residuals as Kimi targets fixed-sum PreNorm bottlenecks

A different way to handle residual paths

Why the Reddit thread mattered

Related Articles

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall

Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec

Related Articles

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall
LLM Hacker News May 16, 2026 1 min read

Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth
LLM Hacker News Mar 21, 2026 2 min read

Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec
LLM Reddit May 12, 2026 1 min read