r/MachineLearning Debates a 1.088B-Parameter Pure SNN Language Model
Original: I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R] View original →
What the post claimed
A research-heavy thread on r/MachineLearning drew attention by claiming a pure spiking neural network language model reached 1.088B parameters from random initialization without relying on ANN-to-SNN conversion or distillation. The author, an 18-year-old independent developer, said the run had to stop at 27k steps because the training budget was exhausted, but that the model still converged to a loss of 4.4. That is far from state-of-the-art language quality, yet the central claim is important: direct large-scale SNN training may be difficult, but it may not be impossible.
The post highlighted three observations from the run. First, the model reportedly maintained about 93% sparsity, with only roughly 7% of neurons firing per token. Second, structurally correct Russian text reportedly started to appear around step 25k even without explicit weighting for that language in the dataset mix. Third, once the architecture grew past 600M parameters, about 39% of activation routing shifted into a persistent memory module, which the author interpreted as the model learning that memory becomes more valuable at larger scale.
Why researchers found it interesting
If those dynamics hold up under deeper evaluation, they matter for two reasons. The first is efficiency: sparse firing is one of the main reasons SNNs remain attractive for neuromorphic systems and memory-sensitive inference. The second is methodology: many earlier large-model SNN results lean on conversion or distillation because direct training is unstable. A post claiming random-init convergence at 1.088B parameters naturally invites attention, even if the run is incomplete.
The author was also unusually explicit about the limits. Generation quality was described as still “janky” and nowhere near GPT-2 fluency. That framing kept the thread closer to systems research than hype.
Where the community pushed back
The comments quickly shifted from excitement to measurement. One of the strongest requests was to convert the reported loss into a cross-model comparable metric such as bits-per-byte. Others asked how the architecture would map to neuromorphic hardware like Loihi, pointed to earlier smaller-scale SNN-LLM work, and questioned whether sparsity benefits would survive real deployment costs. The result is a useful community snapshot: unconventional training results will get attention, but only if the next step is better baselines, reproducible checkpoints, and clearer evaluation than a single promising loss curve.
Related Articles
The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.
A r/LocalLLaMA post reports a from-scratch 144M-parameter Spiking Neural Network language model experiment named Nord. The author claims 97-98% inference sparsity, STDP-based online updates, and better prompt-level topic retention than GPT-2 Small on limited examples, while clearly noting current loss and benchmark limitations.
A Reddit thread surfaced Kimi's AttnRes paper, which argues that fixed residual accumulation in PreNorm LLMs dilutes deeper layers. The proposed attention-based residual path and its block variant aim to keep the gains without exploding memory cost.
Comments (0)
No comments yet. Be the first to comment!