Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

Hacker News pushed the March 20, 2026 submission for Attention Residuals to 114 points. The thread is smaller than a mainstream launch, but the topic hits a recurrent HN nerve: a simple architectural change that looks incremental on paper and then turns out to have system-level consequences for large language models.

The paper and official repo start from a familiar PreNorm complaint. Standard residual connections keep adding previous layer outputs with fixed unit weights. As models get deeper, hidden-state magnitudes grow and individual layer contributions get diluted. Attention Residuals, or AttnRes, swaps that fixed accumulation for softmax attention over earlier layer outputs, letting each layer choose what to reuse based on the current input.

Why the community paid attention

The proposal is not just “more attention everywhere.” The authors also describe Block AttnRes, which groups layers into blocks and applies attention over block-level representations instead of every prior layer. That reduces the memory burden from O(Ld) to O(Nd) and makes the method practical enough to consider as a drop-in replacement rather than a research curiosity.

Scaling-law runs reported consistent gains across model sizes and compute budgets.
The repo says Block AttnRes can match the loss of a baseline trained with 1.25x more compute.
In a Kimi Linear model with 48B total parameters and 3B activated parameters trained on 1.4T tokens, reported scores improved from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, and 59.1 to 62.2 on HumanEval.

That is the sort of result HN readers tend to take seriously: not a vague claim about “better reasoning,” but a concrete change to how depth is aggregated, paired with efficiency work and benchmark deltas that can be challenged or reproduced. If AttnRes holds up beyond Moonshot AI’s stack, it could reopen discussion about residual design in future Transformer and linear-attention models.

Sources: Hacker News thread, official repo, arXiv paper.

Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

Why the community paid attention

Related Articles

Reddit Spotlights Stanford's Open CS25 Transformers Course for Spring 2026

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall

30papers.com turns a famous ML reading list into a friendlier first pass

Related Articles

Reddit Spotlights Stanford's Open CS25 Transformers Course for Spring 2026
LLM Reddit Apr 3, 2026 2 min read

Δ-Mem: Compact Online Memory State Boosts LLM Long-Term Recall
LLM Hacker News May 16, 2026 1 min read

30papers.com turns a famous ML reading list into a friendlier first pass