r/MachineLearning highlights Attention Residuals as Kimi targets fixed-sum PreNorm bottlenecks

Original: [R] Attention Residuals by Kimi Team View original →

Read in other languages: 한국어日本語
LLM Mar 18, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A Reddit thread in r/MachineLearning pushed Kimi Team’s Attention Residuals paper into wider view, drawing 67 upvotes and 10 comments around a deceptively simple question: what if one of the weaknesses in modern LLMs is not attention itself, but the way residual paths blindly accumulate prior layer outputs? The linked arXiv paper, 2603.15031, argues that standard PreNorm residual connections sum all earlier layer outputs with fixed unit weight. Over depth, that can cause hidden-state growth and dilute the contribution of individual layers.

A different way to handle residual paths

The proposed fix is Attention Residuals (AttnRes). Instead of adding previous layer outputs with uniform weight, AttnRes lets a layer apply softmax attention over preceding states and selectively aggregate them using learned, input-dependent weights. In other words, the model does not just inherit every earlier representation equally; it can choose which prior layers matter more for the current input. The authors present this as a way to reduce the fixed-sum behavior that makes deep PreNorm stacks harder to manage.

The paper also addresses the obvious systems objection: attending over every previous layer can be expensive. To keep the mechanism practical at scale, the authors introduce Block AttnRes, which works with block-level representations to reduce memory and communication overhead while preserving most of the gains of the full method. That matters because architecture ideas often look elegant on paper and then fall apart when training cost enters the room. Here, the implementation story is part of the pitch, not an afterthought.

Why the Reddit thread mattered

The post notes that the method was integrated into Kimi Linear, using a 48B-parameter model with 3B activated parameters, and pre-trained on 1.4T tokens. According to the paper, the change improved downstream performance, made output magnitudes more uniform across depth, and produced a better gradient distribution. In the comments, one reader said the idea felt intuitive and was surprising mainly because it had not been explored more aggressively before. Another asked the classic architecture question: how much of the gain comes from the specific residual redesign, and how much might simply come from extra parameters or a different compute budget?

That exchange captures why the thread is useful. It points to a live research theme in LLM design: some of the next gains may come from revisiting the supposedly settled plumbing of transformer stacks rather than only scaling context length or post-training recipes. Residual connections are so foundational that they often disappear into the background. AttnRes brings them back into focus as an active source of model behavior, which is exactly why r/MachineLearning treated the paper as more than another incremental transformer tweak.

Share: Long

Related Articles

LLM sources.twitter 6d ago 2 min read

Anthropic says Claude for Excel and Claude for PowerPoint now share conversation context across open files, reducing the need to restate data or instructions between spreadsheets and decks. The company also added skills inside the add-ins and expanded deployment through Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.