Tiny Transformers with Under 100 Parameters Achieve 100% Accuracy on 10-Digit Addition
Original: [R] Tiny transformers (<100 params) can add two 10-digit numbers to 100% accuracy View original →
Surprising Math Ability from Tiny Models
A fascinating research project shared on r/MachineLearning (144 upvotes) demonstrates that transformer models with fewer than 100 parameters can achieve 100% accuracy when adding two 10-digit numbers. The work, published as the AdderBoard project on GitHub, challenges assumptions about model scale and arithmetic capability.
The Key: Digit Tokenization
The secret behind this performance is digit tokenization — treating each individual digit as a separate token rather than processing numbers as whole units. This allows the model to learn the carry-over rules of arithmetic addition much more effectively, as each positional step becomes a learnable unit. Community members noted that this representation choice is essential: floating-point arithmetic would be dramatically harder.
Why This Matters
Large language models with billions of parameters frequently make errors on simple arithmetic. The fact that a model with under 100 parameters can perfectly solve 10-digit addition highlights that scale is not the only variable that matters — how data is represented and what the model is asked to learn are equally critical design choices.
Limitations and Future Work
The researchers note that while this approach works extremely well for integer addition, floating-point arithmetic presents a much harder challenge due to the increased complexity of number representation. This work opens new directions for understanding how to make AI models more reliably numerically accurate at minimal parameter cost.
Related Articles
A new arXiv paper introduces Δ-Mem, a compact fixed-size memory mechanism that augments frozen LLMs with delta-rule learning. It achieves 1.31× improvement on MemoryAgentBench using just an 8×8 state matrix, without retraining the base model.
Inception Labs has released Mercury 2, the first production-ready diffusion language model for reasoning. Running at over 1,000 tokens per second on Blackwell GPUs, it is dramatically faster and cheaper than leading autoregressive competitors.
The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.