Hacker News revives ATTN/11, a Transformer trained in PDP-11 assembly
Original: Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer View original →
Why Hacker News picked it up
Hacker News pushed ATTN/11 because it turns “can a Transformer train on old hardware?” into a concrete engineering result instead of a nostalgia demo. The project is a single-layer, single-head Transformer written in PDP-11 assembly for a PDP-11/34A, and it is trained on a digit-reversal task rather than just running inference from static weights. That makes it a real training story, not just retro packaging around modern ideas.
The README keeps the architecture deliberately compact: an encoder-only stack with d_model 16, sequence length 8, vocabulary 10, and 1,216 parameters. The data path is the familiar Transformer skeleton of embedding, self-attention, residual connection, output projection, and softmax. The author explicitly notes that this is not BERT or GPT because there is no decoder, feed-forward block, or layer norm, but it is still a genuine self-attention model built to learn a nontrivial routing task.
What made it fit 1970 hardware
The interesting work is below the headline. An early Fortran IV version reportedly needed 25 minutes per 100 steps and 1,500 steps to hit 100% accuracy, implying about 6.5 hours of training on real hardware. An assembly rewrite plus hand-tuned per-layer learning rates reduced that to 600 steps and an estimated 2.5 hours, and the final fixed-point NN11 stack pushed the optimized build down to 350 steps and about 5.5 minutes on the author’s PDP-11/34A.
The README ties that speedup to very concrete choices. The project uses plain SGD instead of Adam to avoid extra state vectors and costly square roots or divisions, relies on exp and log lookup tables for softmax and loss reporting, and adopts Q8/Q15 fixed-point math so the model fits in 32KB of core memory instead of 64KB. The resulting binary is 6,179 bytes, and the sample console log ends with 10/10 accuracy on the reversal task.
Why the project matters
ATTN/11 is not evidence that 1970s minicomputers can train modern LLMs. Its real value is narrower and more interesting: it shows which pieces of the Transformer stack are fundamental enough to survive severe hardware limits. Strip the system down to self-attention, residuals, fixed-point arithmetic, and a small algorithmic task, and training still works. That is exactly the kind of project Hacker News likes because it pulls abstraction back into mechanism and makes the core requirements of a Transformer feel physical again.
Related Articles
A Reddit post in r/MachineLearning highlights a new MIT 2026 course on flow matching and diffusion models with lecture videos, mathematically self-contained notes, and coding exercises. The updated course expands into latent spaces, diffusion transformers, and discrete diffusion language models.
A post on r/LocalLLaMA highlighted Kreuzberg v4.5, a Rust-based document intelligence framework that now adds stronger layout and table understanding. The release claims Docling-level quality with lower memory overhead and materially faster processing.
A widely discussed Hacker News thread surfaced a Rust community summary that sees AI as useful for search, review assistance, and tedious semi-structured work, but risky for learning, subtle defects, ethics, power use, and vendor concentration.
Comments (0)
No comments yet. Be the first to comment!