r/MachineLearning Watches Clip to Grok Claim 18x-to-66x Faster Generalization

Original: [P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo View original →

Read in other languages: 한국어日本語
AI Mar 20, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Reddit surfaced a small but concrete grokking claim

On March 17, 2026, a r/MachineLearning post about Clip to Grok reached 56 points and 20 comments at crawl time. The authors describe a very simple intervention: after every optimizer step, clip each decoder-layer weight row to an L2 norm boundary. In the repository README, they call it per-row weight norm clipping and position it as a way to remove grokking delay without weight decay, gradient filtering, or optimizer-specific retuning.

The reported benchmark numbers are what drew attention. On the repository's modular arithmetic setup, the authors say a 2-layer 422k-parameter model using Lion plus clipping hits the target in 550 median steps versus 35,040 for an AdamW baseline, a 66x speedup. On an 8-layer 1.6M-parameter model, they report 1,570 median steps versus 28,905 for baseline, about 18x faster, plus zero failures across 300 edge-initialized runs. The README also says embeddings and output heads are intentionally excluded from clipping, while decoder weights and final LayerNorm are constrained after each optimizer update.

Why the post is interesting

What makes the thread worth watching is the simplicity-to-effect ratio. Machine learning is full of techniques that promise better generalization but require complicated schedules or fragile hyperparameter tuning. Here the claim is the opposite: a short post-step projection, fixed max_norm=2.0, and broad optimizer tolerance were enough to change the training dynamics of grokking-style tasks. If that generalizes, it could make a traditionally slow and unstable phenomenon much easier to study.

The caution is explicit in the Reddit post itself. The authors say all current results are on modular arithmetic, not frontier language-model pretraining, and that a 277M LLM test was still running on their hardware. So the honest takeaway is not that grokking has been solved universally. It is that a narrow benchmark produced unusually large gains from a cheap intervention, and the community now has code and a PDF to interrogate that claim instead of just a chart on social media.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.