A LocalLLaMA Experiment Put a Tiny LLM on a 1998 iMac G3 with 32 MB of RAM
Original: I technically got an LLM running locally on a 1998 iMac G3 with 32 MB of RAM View original →
A high-signal LocalLLaMA post this week described an experiment that sounds absurd until you read the implementation notes: a stock 1998 iMac G3 with 32 MB of RAM running a local language model. The project ports Karpathy's llama2.c approach to classic Mac OS and targets the original Bondi Blue iMac without hardware upgrades.
The model choice is what makes the trick possible. Instead of forcing a contemporary checkpoint onto vintage hardware, the author uses the 260K-parameter TinyStories model with a roughly 1 MB checkpoint and runs it entirely in local memory. According to the README, the app reads a prompt from prompt.txt, tokenizes it with a 512-token BPE vocabulary, executes the transformer forward pass, and writes the continuation to output.txt. On the 233 MHz PowerPC G3, 32 generated tokens take less than a second.
The technical details are more interesting than the novelty headline. Because the PowerPC CPU is big-endian, the model and tokenizer files must be byte-swapped before use. Mac OS 8.5 also gives applications a tiny default memory partition, so the project expands heap space with MaxApplZone(), allocates through NewPtr(), and relies on static buffers to avoid malloc failures. The author also had to cap sequence length from 512 to 32 and fix a grouped-query attention weight-layout bug so later tensors would not point at the wrong memory addresses.
Why this experiment matters
This is not about useful throughput or modern reasoning quality. It is a sharp illustration of how small language models can travel when the software stack is stripped down to essentials. The repo documents cross-compilation with Retro68, endian conversion, file transfer over FTP, and even the lack of a usable console on Mac OS 8.5, which forced all debugging into text files.
- Hardware: 233 MHz PowerPC 750, 32 MB RAM, Mac OS 8.5.
- Model: TinyStories 260K, Llama 2 architecture, about 1 MB checkpoint.
- Main lesson: tiny checkpoints and careful systems work can push local inference far beyond what modern expectations suggest.
In other words, the post is less a stunt than a compact history lesson: the minimum viable LLM is much smaller than the models dominating current product conversations.
Related Articles
Ollama used a March 30, 2026 preview to move its Apple Silicon path onto MLX. The release pairs higher prefill and decode throughput with NVFP4 support and cache changes aimed at coding and agent workflows.
A popular LocalLLaMA benchmark post argued that Qwen3.5 27B hits an attractive balance between model size and throughput, using an RTX A6000, llama.cpp with CUDA, and a 32k context window to show roughly 19.7 tokens per second.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
Comments (0)
No comments yet. Be the first to comment!