r/MachineLearning Debates a 1.088B-Parameter Pure SNN Language Model
Original: I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R] View original →
What the post claimed
A research-heavy thread on r/MachineLearning drew attention by claiming a pure spiking neural network language model reached 1.088B parameters from random initialization without relying on ANN-to-SNN conversion or distillation. The author, an 18-year-old independent developer, said the run had to stop at 27k steps because the training budget was exhausted, but that the model still converged to a loss of 4.4. That is far from state-of-the-art language quality, yet the central claim is important: direct large-scale SNN training may be difficult, but it may not be impossible.
The post highlighted three observations from the run. First, the model reportedly maintained about 93% sparsity, with only roughly 7% of neurons firing per token. Second, structurally correct Russian text reportedly started to appear around step 25k even without explicit weighting for that language in the dataset mix. Third, once the architecture grew past 600M parameters, about 39% of activation routing shifted into a persistent memory module, which the author interpreted as the model learning that memory becomes more valuable at larger scale.
Why researchers found it interesting
If those dynamics hold up under deeper evaluation, they matter for two reasons. The first is efficiency: sparse firing is one of the main reasons SNNs remain attractive for neuromorphic systems and memory-sensitive inference. The second is methodology: many earlier large-model SNN results lean on conversion or distillation because direct training is unstable. A post claiming random-init convergence at 1.088B parameters naturally invites attention, even if the run is incomplete.
The author was also unusually explicit about the limits. Generation quality was described as still “janky” and nowhere near GPT-2 fluency. That framing kept the thread closer to systems research than hype.
Where the community pushed back
The comments quickly shifted from excitement to measurement. One of the strongest requests was to convert the reported loss into a cross-model comparable metric such as bits-per-byte. Others asked how the architecture would map to neuromorphic hardware like Loihi, pointed to earlier smaller-scale SNN-LLM work, and questioned whether sparsity benefits would survive real deployment costs. The result is a useful community snapshot: unconventional training results will get attention, but only if the next step is better baselines, reproducible checkpoints, and clearer evaluation than a single promising loss curve.
Related Articles
r/MachineLearning treated this less like a finished breakthrough and more like a serious challenge to the current assumptions around large-scale spike-domain training. The April 13, 2026 post reported a 1.088B pure SNN language model reaching loss 4.4 at 27K steps with 93% sparsity, while commenters pushed for more comparable metrics and longer training before drawing big conclusions.
A r/LocalLLaMA post reports a from-scratch 144M-parameter Spiking Neural Network language model experiment named Nord. The author claims 97-98% inference sparsity, STDP-based online updates, and better prompt-level topic retention than GPT-2 Small on limited examples, while clearly noting current loss and benchmark limitations.
The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.
Comments (0)
No comments yet. Be the first to comment!