Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth
Original: Attention Residuals View original →
Hacker News pushed the March 20, 2026 submission for Attention Residuals to 114 points. The thread is smaller than a mainstream launch, but the topic hits a recurrent HN nerve: a simple architectural change that looks incremental on paper and then turns out to have system-level consequences for large language models.
The paper and official repo start from a familiar PreNorm complaint. Standard residual connections keep adding previous layer outputs with fixed unit weights. As models get deeper, hidden-state magnitudes grow and individual layer contributions get diluted. Attention Residuals, or AttnRes, swaps that fixed accumulation for softmax attention over earlier layer outputs, letting each layer choose what to reuse based on the current input.
Why the community paid attention
The proposal is not just “more attention everywhere.” The authors also describe Block AttnRes, which groups layers into blocks and applies attention over block-level representations instead of every prior layer. That reduces the memory burden from O(Ld) to O(Nd) and makes the method practical enough to consider as a drop-in replacement rather than a research curiosity.
- Scaling-law runs reported consistent gains across model sizes and compute budgets.
- The repo says Block AttnRes can match the loss of a baseline trained with 1.25x more compute.
- In a Kimi Linear model with 48B total parameters and 3B activated parameters trained on 1.4T tokens, reported scores improved from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, and 59.1 to 62.2 on HumanEval.
That is the sort of result HN readers tend to take seriously: not a vague claim about “better reasoning,” but a concrete change to how depth is aggregated, paired with efficiency work and benchmark deltas that can be challenged or reproduced. If AttnRes holds up beyond Moonshot AI’s stack, it could reopen discussion about residual design in future Transformer and linear-attention models.
Sources: Hacker News thread, official repo, arXiv paper.
Related Articles
A new arXiv paper introduces Δ-Mem, a compact fixed-size memory mechanism that augments frozen LLMs with delta-rule learning. It achieves 1.31× improvement on MemoryAgentBench using just an 8×8 state matrix, without retraining the base model.
Fields Medalist Timothy Gowers: GPT-5.5 Pro Produced PhD-Level Math Proofs — Research Faces 'Crisis'
Fields Medal-winning mathematician Timothy Gowers tested ChatGPT 5.5 Pro on open math problems and found it produced PhD-level proofs in about an hour, warning that mathematical research faces an imminent 'crisis' at the current rate of AI progress.
A new DELEGATE-52 benchmark study finds that even frontier LLMs like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content during long delegated workflows, with errors compounding silently.
Comments (0)
No comments yet. Be the first to comment!