r/singularity highlights a paper arguing the LM head wastes most of the training signal
Original: Lost in Backpropagation: The LM Head is a Gradient Bottleneck | Researchers may have found a fundamental inefficiency baked into every major LLM View original →
A Reddit thread in r/singularity surfaced an unusually technical paper for a general AI community: arXiv:2603.10145, Lost in Backpropagation: The LM Head is a Gradient Bottleneck. The paper argues that the output layer of neural language models is not just a familiar softmax expressivity bottleneck. It may also be an optimization bottleneck that quietly wastes a large share of the training signal before it reaches the rest of the model.
The core setup is simple. Language models map hidden features of size D into a vocabulary of size V, where D is much smaller than V. The authors argue that when gradients are backpropagated through that rank-D output layer, unavoidable compression occurs. In the abstract, they say 95 to 99 percent of the gradient norm is suppressed by the output layer, which pushes training updates away from the directions that would be most informative. That turns a long-discussed architectural quirk into a much more serious claim about optimization efficiency.
The paper goes further than theory. According to the abstract, the authors run controlled experiments showing that as vocabulary size grows, the bottleneck can make even trivial patterns hard to learn. They also report materially slower convergence in realistic pretraining runs at the 2B-parameter scale. Their conclusion is that current language models may be training less efficiently than they could, independently of the wider architecture, simply because the last layer is throwing away too much useful supervision signal.
Reddit readers focused on exactly that implication. The top comment highlighted the paper's conclusion that the softmax bottleneck is not only about expressivity but about losing most of the supervision signal during backpropagation. Others jumped quickly to alternatives, pointing at latent-space generation ideas or other nonstandard output schemes as possible ways around the problem. Even in a short thread, the reaction was notable: people treated the LM head as a potentially under-discussed systems bottleneck rather than just a mathematical footnote.
If the result holds up, it matters for more than paper debates. A lot of current LLM progress still comes from scaling data, compute, and model size. This paper suggests that another lever may be hiding in plain sight: changing how models project hidden states into vocabulary logits and how gradients flow back through that interface. Source: arXiv:2603.10145. Community discussion: r/singularity.
Related Articles
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
Claude Fable 5 has moved to the top of Artificial Analysis’s GDPval-AA benchmark with a 1932 score. The result puts Anthropic models in three of the top four slots and raises the bar for long-running agentic knowledge work.
Anthropic is not only shipping a stronger Claude model; it is splitting the same base capability into a broad Fable release and a restricted Mythos track. The package includes $10/$50 token pricing, 30-day safety retention, and automatic fallback to Opus 4.8 for some high-risk requests.