What was posted

A high-engagement post on r/MachineLearning shared an anonymously authored PDF from a Korean community and framed it as a mathematical argument about Attention complexity. The post argues that when forward and backward dynamics are considered together, the optimization landscape explored by parameters is fundamentally d^2-dimensional rather than n^2. It also suggests this perspective could motivate alternatives to standard softmax attention.

The source thread is here: r/MachineLearning discussion. At crawl time, it had strong visibility with substantial comments, making it relevant as a community signal even though the claim itself remains unverified.

Main claims discussed in the thread

Attention optimization geometry should be interpreted through a d^2 lens.
Softmax may preserve matching behavior but contributes to an expensive scaling pattern.
A polynomial-style alternative might keep useful structure while changing complexity tradeoffs.

Commenters quickly moved from hype to critique. Several top responses said the derivation may be interesting as theory framing, but that equal optimization dimensionality does not prove functional equivalence between kernels. Others noted that complexity comparisons such as O(nd^3) versus O(n^2d) depend heavily on practical ranges of d, sequence lengths, and hardware behavior.

Why this still matters

Even if the theorem-level claim does not hold under review, the thread highlights a valuable pattern: ML communities are actively stress-testing how we describe Attention bottlenecks. That matters for model design, inference engineering, and benchmark interpretation. In practice, the right takeaway is not “replace Transformers now,” but “separate geometric insight from deployment evidence.”

For practitioners, a sensible evaluation checklist is: verify reproducible code, compare against established linear/hybrid attention baselines, track wall-clock and memory behavior in addition to asymptotic notation, and require independent peer validation before architectural conclusions.

Source: Reddit post

#ml-theory

Reddit Debate: Is Attention fundamentally a d^2 problem rather than n^2?