What r/MachineLearning is actually discussing in the RBF-Attention experiment

Original: [P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...) View original →

Read in other languages: 한국어日本語
LLM Apr 1, 2026 By Insights AI (Reddit) 1 min read 1 views Source

A project post on r/MachineLearning with 165 points and 23 comments explores a deceptively simple question: what happens if standard scaled dot-product attention is replaced with distance-based RBF attention? The author starts from a familiar criticism of dot products: attention scores are sensitive to vector magnitude, so a large key can dominate softmax even when it is not the best conceptual match. The proposed fix is to score tokens by Euclidean closeness rather than directional projection.

The most useful part of the write-up is that it does not stop at the idea. It walks through the engineering consequences. A naive torch.cdist implementation blows up memory, so the score is algebraically rewritten into a form equivalent to 2(Q·K)-||K||^2 after dropping the query norm term via softmax shift invariance. That keeps part of the computation compatible with existing matrix-multiplication pipelines. Even then, PyTorch’s fused SDPA path cannot inject the key-norm penalty, so the author wrote a custom Triton kernel to make the experiment practical.

  • The model lost ordinary attention sinks, so the author added register tokens as dedicated places to absorb unused attention mass.
  • RoPE was removed because rotational geometry no longer fits the distance-based interpretation, and replaced with additive SuSiE embeddings.
  • A small TinyStories causal model converged slightly faster than the baseline, but the author explicitly says this is not a near-term FlashAttention replacement.

The Reddit comments pushed in the same direction. Some readers connected the work to other kernelized attention ideas, while others pointed out the hardware reality: even interesting alternatives struggle when the ecosystem is optimized around dot products. That is why this post matters. It is less a product announcement than a careful field report on how deeply a single mathematical choice is baked into the modern LLM stack.

Sources: the Reddit thread, the technical blog post, and the code repository.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.