r/singularity highlights a paper arguing the LM head wastes most of the training signal
Original: Lost in Backpropagation: The LM Head is a Gradient Bottleneck | Researchers may have found a fundamental inefficiency baked into every major LLM View original →
A Reddit thread in r/singularity surfaced an unusually technical paper for a general AI community: arXiv:2603.10145, Lost in Backpropagation: The LM Head is a Gradient Bottleneck. The paper argues that the output layer of neural language models is not just a familiar softmax expressivity bottleneck. It may also be an optimization bottleneck that quietly wastes a large share of the training signal before it reaches the rest of the model.
The core setup is simple. Language models map hidden features of size D into a vocabulary of size V, where D is much smaller than V. The authors argue that when gradients are backpropagated through that rank-D output layer, unavoidable compression occurs. In the abstract, they say 95 to 99 percent of the gradient norm is suppressed by the output layer, which pushes training updates away from the directions that would be most informative. That turns a long-discussed architectural quirk into a much more serious claim about optimization efficiency.
The paper goes further than theory. According to the abstract, the authors run controlled experiments showing that as vocabulary size grows, the bottleneck can make even trivial patterns hard to learn. They also report materially slower convergence in realistic pretraining runs at the 2B-parameter scale. Their conclusion is that current language models may be training less efficiently than they could, independently of the wider architecture, simply because the last layer is throwing away too much useful supervision signal.
Reddit readers focused on exactly that implication. The top comment highlighted the paper's conclusion that the softmax bottleneck is not only about expressivity but about losing most of the supervision signal during backpropagation. Others jumped quickly to alternatives, pointing at latent-space generation ideas or other nonstandard output schemes as possible ways around the problem. Even in a short thread, the reaction was notable: people treated the LM head as a potentially under-discussed systems bottleneck rather than just a mathematical footnote.
If the result holds up, it matters for more than paper debates. A lot of current LLM progress still comes from scaling data, compute, and model size. This paper suggests that another lever may be hiding in plain sight: changing how models project hidden states into vocabulary logits and how gradients flow back through that interface. Source: arXiv:2603.10145. Community discussion: r/singularity.
Related Articles
The arXiv paper Ares, submitted on March 9, 2026, proposes dynamic per-step reasoning selection for multi-step LLM agents. The authors report up to 52.7% lower reasoning token usage versus fixed high-effort settings with only minimal drops in task success.
A February 13, 2026 post in r/LocalLLaMA highlighted NVIDIA Dynamic Memory Sparsification (DMS), claiming up to 8x KV cache memory savings without accuracy loss. Community discussion centered on inference cost, throughput, and what needs verification from primary technical sources.
A LocalLLaMA post pointed to a new Hugging Face dataset of human-written code reviews, pairing before-and-after code changes with inline reviewer comments and negative examples across 37 languages.
Comments (0)
No comments yet. Be the first to comment!