r/singularity highlights a paper arguing the LM head wastes most of the training signal
Original: Lost in Backpropagation: The LM Head is a Gradient Bottleneck | Researchers may have found a fundamental inefficiency baked into every major LLM View original →
A Reddit thread in r/singularity surfaced an unusually technical paper for a general AI community: arXiv:2603.10145, Lost in Backpropagation: The LM Head is a Gradient Bottleneck. The paper argues that the output layer of neural language models is not just a familiar softmax expressivity bottleneck. It may also be an optimization bottleneck that quietly wastes a large share of the training signal before it reaches the rest of the model.
The core setup is simple. Language models map hidden features of size D into a vocabulary of size V, where D is much smaller than V. The authors argue that when gradients are backpropagated through that rank-D output layer, unavoidable compression occurs. In the abstract, they say 95 to 99 percent of the gradient norm is suppressed by the output layer, which pushes training updates away from the directions that would be most informative. That turns a long-discussed architectural quirk into a much more serious claim about optimization efficiency.
The paper goes further than theory. According to the abstract, the authors run controlled experiments showing that as vocabulary size grows, the bottleneck can make even trivial patterns hard to learn. They also report materially slower convergence in realistic pretraining runs at the 2B-parameter scale. Their conclusion is that current language models may be training less efficiently than they could, independently of the wider architecture, simply because the last layer is throwing away too much useful supervision signal.
Reddit readers focused on exactly that implication. The top comment highlighted the paper's conclusion that the softmax bottleneck is not only about expressivity but about losing most of the supervision signal during backpropagation. Others jumped quickly to alternatives, pointing at latent-space generation ideas or other nonstandard output schemes as possible ways around the problem. Even in a short thread, the reaction was notable: people treated the LM head as a potentially under-discussed systems bottleneck rather than just a mathematical footnote.
If the result holds up, it matters for more than paper debates. A lot of current LLM progress still comes from scaling data, compute, and model size. This paper suggests that another lever may be hiding in plain sight: changing how models project hidden states into vocabulary logits and how gradients flow back through that interface. Source: arXiv:2603.10145. Community discussion: r/singularity.
Related Articles
MegaTrain proposes training 100B+ parameter LLMs at full precision on a single GPU by keeping parameters and optimizer states in host memory and streaming layers through the device. The recent Hacker News interest is notable because the paper reframes the problem as one of memory-system design rather than simple GPU count.
A Hacker News discussion focused on SkyPilot's argument that coding agents work better when they read papers and competing implementations before editing code. In the reported llama.cpp experiments, that research-first loop produced 5 viable optimizations and improved TinyLlama text generation by 15% on x86 and 5% on ARM for about $29.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
Comments (0)
No comments yet. Be the first to comment!