r/singularity highlights a paper arguing the LM head wastes most of the training signal

A Reddit thread in r/singularity surfaced an unusually technical paper for a general AI community: arXiv:2603.10145, Lost in Backpropagation: The LM Head is a Gradient Bottleneck. The paper argues that the output layer of neural language models is not just a familiar softmax expressivity bottleneck. It may also be an optimization bottleneck that quietly wastes a large share of the training signal before it reaches the rest of the model.

The core setup is simple. Language models map hidden features of size D into a vocabulary of size V, where D is much smaller than V. The authors argue that when gradients are backpropagated through that rank-D output layer, unavoidable compression occurs. In the abstract, they say 95 to 99 percent of the gradient norm is suppressed by the output layer, which pushes training updates away from the directions that would be most informative. That turns a long-discussed architectural quirk into a much more serious claim about optimization efficiency.

The paper goes further than theory. According to the abstract, the authors run controlled experiments showing that as vocabulary size grows, the bottleneck can make even trivial patterns hard to learn. They also report materially slower convergence in realistic pretraining runs at the 2B-parameter scale. Their conclusion is that current language models may be training less efficiently than they could, independently of the wider architecture, simply because the last layer is throwing away too much useful supervision signal.

Reddit readers focused on exactly that implication. The top comment highlighted the paper's conclusion that the softmax bottleneck is not only about expressivity but about losing most of the supervision signal during backpropagation. Others jumped quickly to alternatives, pointing at latent-space generation ideas or other nonstandard output schemes as possible ways around the problem. Even in a short thread, the reaction was notable: people treated the LM head as a potentially under-discussed systems bottleneck rather than just a mathematical footnote.

If the result holds up, it matters for more than paper debates. A lot of current LLM progress still comes from scaling data, compute, and model size. This paper suggests that another lever may be hiding in plain sight: changing how models project hidden states into vocabulary logits and how gradients flow back through that interface. Source: arXiv:2603.10145. Community discussion: r/singularity.

r/singularity highlights a paper arguing the LM head wastes most of the training signal

Related Articles

MegaTrain turns a Hacker News paper pick into a memory-systems debate about single-GPU LLM training

Hacker News Zeroes In on Research-Driven Coding Agents

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

Comments (0)

Leave a Comment

Related Articles

MegaTrain turns a Hacker News paper pick into a memory-systems debate about single-GPU LLM training
LLM Hacker News Apr 8, 2026 2 min read

Hacker News Zeroes In on Research-Driven Coding Agents
LLM Hacker News Apr 10, 2026 2 min read

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM
LLM Reddit Apr 16, 2026 2 min read