Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth
Original: Attention Residuals View original →
Hacker News pushed the March 20, 2026 submission for Attention Residuals to 114 points. The thread is smaller than a mainstream launch, but the topic hits a recurrent HN nerve: a simple architectural change that looks incremental on paper and then turns out to have system-level consequences for large language models.
The paper and official repo start from a familiar PreNorm complaint. Standard residual connections keep adding previous layer outputs with fixed unit weights. As models get deeper, hidden-state magnitudes grow and individual layer contributions get diluted. Attention Residuals, or AttnRes, swaps that fixed accumulation for softmax attention over earlier layer outputs, letting each layer choose what to reuse based on the current input.
Why the community paid attention
The proposal is not just “more attention everywhere.” The authors also describe Block AttnRes, which groups layers into blocks and applies attention over block-level representations instead of every prior layer. That reduces the memory burden from O(Ld) to O(Nd) and makes the method practical enough to consider as a drop-in replacement rather than a research curiosity.
- Scaling-law runs reported consistent gains across model sizes and compute budgets.
- The repo says Block AttnRes can match the loss of a baseline trained with 1.25x more compute.
- In a Kimi Linear model with 48B total parameters and 3B activated parameters trained on 1.4T tokens, reported scores improved from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, and 59.1 to 62.2 on HumanEval.
That is the sort of result HN readers tend to take seriously: not a vague claim about “better reasoning,” but a concrete change to how depth is aggregated, paired with efficiency work and benchmark deltas that can be challenged or reproduced. If AttnRes holds up beyond Moonshot AI’s stack, it could reopen discussion about residual design in future Transformer and linear-attention models.
Sources: Hacker News thread, official repo, arXiv paper.
Related Articles
A Reddit thread surfaced Kimi's AttnRes paper, which argues that fixed residual accumulation in PreNorm LLMs dilutes deeper layers. The proposed attention-based residual path and its block variant aim to keep the gains without exploding memory cost.
Researchers have demonstrated that transformer models with fewer than 100 parameters can add two 10-digit numbers with 100% accuracy using digit tokenization, challenging assumptions about the minimum complexity needed for arithmetic reasoning.
Inception Labs has released Mercury 2, the first production-ready diffusion language model for reasoning. Running at over 1,000 tokens per second on Blackwell GPUs, it is dramatically faster and cheaper than leading autoregressive competitors.
Comments (0)
No comments yet. Be the first to comment!