A post in r/MachineLearning argues that duplicating a specific seven-layer block inside Qwen2-72B improved benchmark performance without changing any weights.
#transformers
A fast-rising LocalLLaMA post resurfaced David Noel Ng's write-up on duplicating a seven-layer block inside Qwen2-72B, a no-training architecture tweak that reportedly lifted multiple Open LLM Leaderboard benchmarks.
A popular r/MachineLearning discussion examines an unofficial theorem-style claim that Attention’s core optimization geometry is d^2, not n^2. Community response is mixed: strong curiosity, but equally strong calls for peer review and reproducible evidence.
A r/MachineLearning post surfaced AdderBoard, where community submissions report 100% 10-digit addition with extremely small transformer designs, including hand-coded models under 100 parameters.
A r/MachineLearning post surfaced AdderBoard, where community submissions report 100% 10-digit addition with extremely small transformer designs, including hand-coded models under 100 parameters.