r/MachineLearning highlights Attention Residuals as Kimi targets fixed-sum PreNorm bottlenecks
Original: [R] Attention Residuals by Kimi Team View original →
A Reddit thread in r/MachineLearning pushed Kimi Team’s Attention Residuals paper into wider view, drawing 67 upvotes and 10 comments around a deceptively simple question: what if one of the weaknesses in modern LLMs is not attention itself, but the way residual paths blindly accumulate prior layer outputs? The linked arXiv paper, 2603.15031, argues that standard PreNorm residual connections sum all earlier layer outputs with fixed unit weight. Over depth, that can cause hidden-state growth and dilute the contribution of individual layers.
A different way to handle residual paths
The proposed fix is Attention Residuals (AttnRes). Instead of adding previous layer outputs with uniform weight, AttnRes lets a layer apply softmax attention over preceding states and selectively aggregate them using learned, input-dependent weights. In other words, the model does not just inherit every earlier representation equally; it can choose which prior layers matter more for the current input. The authors present this as a way to reduce the fixed-sum behavior that makes deep PreNorm stacks harder to manage.
The paper also addresses the obvious systems objection: attending over every previous layer can be expensive. To keep the mechanism practical at scale, the authors introduce Block AttnRes, which works with block-level representations to reduce memory and communication overhead while preserving most of the gains of the full method. That matters because architecture ideas often look elegant on paper and then fall apart when training cost enters the room. Here, the implementation story is part of the pitch, not an afterthought.
Why the Reddit thread mattered
The post notes that the method was integrated into Kimi Linear, using a 48B-parameter model with 3B activated parameters, and pre-trained on 1.4T tokens. According to the paper, the change improved downstream performance, made output magnitudes more uniform across depth, and produced a better gradient distribution. In the comments, one reader said the idea felt intuitive and was surprising mainly because it had not been explored more aggressively before. Another asked the classic architecture question: how much of the gain comes from the specific residual redesign, and how much might simply come from extra parameters or a different compute budget?
That exchange captures why the thread is useful. It points to a live research theme in LLM design: some of the next gains may come from revisiting the supposedly settled plumbing of transformer stacks rather than only scaling context length or post-training recipes. Residual connections are so foundational that they often disappear into the background. AttnRes brings them back into focus as an active source of model behavior, which is exactly why r/MachineLearning treated the paper as more than another incremental transformer tweak.
Related Articles
A reviewer in r/MachineLearning says an ICML paper in a no-LLM track reads as if it was fully generated by AI, opening a blunt discussion about enforcement, review burden, and whether writing quality itself has become a policy signal.
A post in r/MachineLearning argues that duplicating a specific seven-layer block inside Qwen2-72B improved benchmark performance without changing any weights.
OpenAI Developers published a March 11, 2026 engineering write-up explaining how the Responses API uses a hosted computer environment for long-running agent workflows. The post centers on shell execution, hosted containers, controlled network access, reusable skills, and native compaction for context management.
Comments (0)
No comments yet. Be the first to comment!