Why Hacker News picked it up

Hacker News pushed ATTN/11 because it turns “can a Transformer train on old hardware?” into a concrete engineering result instead of a nostalgia demo. The project is a single-layer, single-head Transformer written in PDP-11 assembly for a PDP-11/34A, and it is trained on a digit-reversal task rather than just running inference from static weights. That makes it a real training story, not just retro packaging around modern ideas.

The README keeps the architecture deliberately compact: an encoder-only stack with d_model 16, sequence length 8, vocabulary 10, and 1,216 parameters. The data path is the familiar Transformer skeleton of embedding, self-attention, residual connection, output projection, and softmax. The author explicitly notes that this is not BERT or GPT because there is no decoder, feed-forward block, or layer norm, but it is still a genuine self-attention model built to learn a nontrivial routing task.

What made it fit 1970 hardware

The interesting work is below the headline. An early Fortran IV version reportedly needed 25 minutes per 100 steps and 1,500 steps to hit 100% accuracy, implying about 6.5 hours of training on real hardware. An assembly rewrite plus hand-tuned per-layer learning rates reduced that to 600 steps and an estimated 2.5 hours, and the final fixed-point NN11 stack pushed the optimized build down to 350 steps and about 5.5 minutes on the author’s PDP-11/34A.

The README ties that speedup to very concrete choices. The project uses plain SGD instead of Adam to avoid extra state vectors and costly square roots or divisions, relies on exp and log lookup tables for softmax and loss reporting, and adopts Q8/Q15 fixed-point math so the model fits in 32KB of core memory instead of 64KB. The resulting binary is 6,179 bytes, and the sample console log ends with 10/10 accuracy on the reversal task.

Why the project matters

ATTN/11 is not evidence that 1970s minicomputers can train modern LLMs. Its real value is narrower and more interesting: it shows which pieces of the Transformer stack are fundamental enough to survive severe hardware limits. Strip the system down to self-attention, residuals, fixed-point arithmetic, and a small algorithmic task, and training still works. That is exactly the kind of project Hacker News likes because it pulls abstraction back into mechanism and makes the core requirements of a Transformer feel physical again.

#retrocomputing

Hacker News revives ATTN/11, a Transformer trained in PDP-11 assembly