Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

Hacker News pushed the March 20, 2026 submission for Attention Residuals to 114 points. The thread is smaller than a mainstream launch, but the topic hits a recurrent HN nerve: a simple architectural change that looks incremental on paper and then turns out to have system-level consequences for large language models.

The paper and official repo start from a familiar PreNorm complaint. Standard residual connections keep adding previous layer outputs with fixed unit weights. As models get deeper, hidden-state magnitudes grow and individual layer contributions get diluted. Attention Residuals, or AttnRes, swaps that fixed accumulation for softmax attention over earlier layer outputs, letting each layer choose what to reuse based on the current input.

Why the community paid attention

The proposal is not just “more attention everywhere.” The authors also describe Block AttnRes, which groups layers into blocks and applies attention over block-level representations instead of every prior layer. That reduces the memory burden from O(Ld) to O(Nd) and makes the method practical enough to consider as a drop-in replacement rather than a research curiosity.

Scaling-law runs reported consistent gains across model sizes and compute budgets.
The repo says Block AttnRes can match the loss of a baseline trained with 1.25x more compute.
In a Kimi Linear model with 48B total parameters and 3B activated parameters trained on 1.4T tokens, reported scores improved from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, and 59.1 to 62.2 on HumanEval.

That is the sort of result HN readers tend to take seriously: not a vague claim about “better reasoning,” but a concrete change to how depth is aggregated, paired with efficiency work and benchmark deltas that can be challenged or reproduced. If AttnRes holds up beyond Moonshot AI’s stack, it could reopen discussion about residual design in future Transformer and linear-attention models.

Sources: Hacker News thread, official repo, arXiv paper.

Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

Why the community paid attention

Related Articles

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall

Fields Medalist Timothy Gowers: GPT-5.5 Pro Produced PhD-Level Math Proofs — Research Faces 'Crisis'

Study: LLMs Silently Corrupt 25% of Documents in Delegated Workflows

Comments (0)

Leave a Comment

Related Articles

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall

Fields Medalist Timothy Gowers: GPT-5.5 Pro Produced PhD-Level Math Proofs — Research Faces 'Crisis'

Study: LLMs Silently Corrupt 25% of Documents in Delegated Workflows