LLM Reddit 3d ago 2 min read
A Reddit thread surfaced arXiv paper 2603.10145, which argues the output layer of language models is not just a softmax expressivity issue but an optimization bottleneck that suppresses 95-99% of gradient norm. The discussion centered on whether better head designs could unlock more efficient LLM training.