Reddit Debate: Is Attention fundamentally a d^2 problem rather than n^2?
Original: [D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included) View original →
What was posted
A high-engagement post on r/MachineLearning shared an anonymously authored PDF from a Korean community and framed it as a mathematical argument about Attention complexity. The post argues that when forward and backward dynamics are considered together, the optimization landscape explored by parameters is fundamentally d^2-dimensional rather than n^2. It also suggests this perspective could motivate alternatives to standard softmax attention.
The source thread is here: r/MachineLearning discussion. At crawl time, it had strong visibility with substantial comments, making it relevant as a community signal even though the claim itself remains unverified.
Main claims discussed in the thread
- Attention optimization geometry should be interpreted through a
d^2lens. - Softmax may preserve matching behavior but contributes to an expensive scaling pattern.
- A polynomial-style alternative might keep useful structure while changing complexity tradeoffs.
Commenters quickly moved from hype to critique. Several top responses said the derivation may be interesting as theory framing, but that equal optimization dimensionality does not prove functional equivalence between kernels. Others noted that complexity comparisons such as O(nd^3) versus O(n^2d) depend heavily on practical ranges of d, sequence lengths, and hardware behavior.
Why this still matters
Even if the theorem-level claim does not hold under review, the thread highlights a valuable pattern: ML communities are actively stress-testing how we describe Attention bottlenecks. That matters for model design, inference engineering, and benchmark interpretation. In practice, the right takeaway is not “replace Transformers now,” but “separate geometric insight from deployment evidence.”
For practitioners, a sensible evaluation checklist is: verify reproducible code, compare against established linear/hybrid attention baselines, track wall-clock and memory behavior in addition to asymptotic notation, and require independent peer validation before architectural conclusions.
Source: Reddit post
Related Articles
A project post on r/MachineLearning stood out because it did not just propose an alternative attention score; it documented the engineering breakage that follows when dot products disappear.
The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.
A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.
Comments (0)
No comments yet. Be the first to comment!