What r/MachineLearning is actually discussing in the RBF-Attention experiment
Original: [P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...) View original →
A project post on r/MachineLearning with 165 points and 23 comments explores a deceptively simple question: what happens if standard scaled dot-product attention is replaced with distance-based RBF attention? The author starts from a familiar criticism of dot products: attention scores are sensitive to vector magnitude, so a large key can dominate softmax even when it is not the best conceptual match. The proposed fix is to score tokens by Euclidean closeness rather than directional projection.
The most useful part of the write-up is that it does not stop at the idea. It walks through the engineering consequences. A naive torch.cdist implementation blows up memory, so the score is algebraically rewritten into a form equivalent to 2(Q·K)-||K||^2 after dropping the query norm term via softmax shift invariance. That keeps part of the computation compatible with existing matrix-multiplication pipelines. Even then, PyTorch’s fused SDPA path cannot inject the key-norm penalty, so the author wrote a custom Triton kernel to make the experiment practical.
- The model lost ordinary attention sinks, so the author added register tokens as dedicated places to absorb unused attention mass.
- RoPE was removed because rotational geometry no longer fits the distance-based interpretation, and replaced with additive SuSiE embeddings.
- A small TinyStories causal model converged slightly faster than the baseline, but the author explicitly says this is not a near-term FlashAttention replacement.
The Reddit comments pushed in the same direction. Some readers connected the work to other kernelized attention ideas, while others pointed out the hardware reality: even interesting alternatives struggle when the ecosystem is optimized around dot products. That is why this post matters. It is less a product announcement than a careful field report on how deeply a single mathematical choice is baked into the modern LLM stack.
Sources: the Reddit thread, the technical blog post, and the code repository.
Related Articles
A popular r/MachineLearning discussion examines an unofficial theorem-style claim that Attention’s core optimization geometry is d^2, not n^2. Community response is mixed: strong curiosity, but equally strong calls for peer review and reproducible evidence.
The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.
David Noel Ng's follow-up post treats layer duplication as a search problem rather than a lucky trick, then ties it to multilingual hidden-state evidence that the middle of the network may host a shared reasoning space.
Comments (0)
No comments yet. Be the first to comment!