13 Months After the DeepSeek Moment: How Far Has Local AI Come?
Original: 13 months since the DeepSeek moment, how far have we gone running models locally? View original →
13 Months of Local AI Progress
In early 2025, a Hugging Face engineer tweeted about how to run the frontier-level DeepSeek R1 model at Q8 quantization at approximately 5 tokens per second — requiring about $6,000 in hardware.
This r/LocalLLaMA post (176 upvotes) provides a striking update: you can now run a significantly more capable model at the same speed on a $600 mini PC. Specifically: Qwen3-27B at Q4 quantization runs at roughly 5 t/s on a $600 AOOSTAR mini PC.
Want More Usable Speeds?
For more practical inference speeds, Qwen3.5-35B-A3B (MoE architecture) at Q4/Q5 quantization runs at 17-20 t/s on comparable hardware. That is a practically useful speed for everyday AI assistance tasks.
Looking Ahead
The author speculates that at this pace, a 4B model better than today's best could be running locally within a year. The trajectory from $6,000 for 5 t/s frontier inference to $600 for better-than-frontier inference in 13 months suggests that genuinely capable local AI on consumer hardware is no longer a distant prospect.
Why This Matters
The democratization of local AI goes beyond cost savings. It enables privacy-first inference without cloud dependencies, makes high-quality AI accessible in regions with limited internet infrastructure, and shifts the balance of power away from cloud AI providers. The speed of this progress is one of the most remarkable dynamics in the current AI landscape.
Related Articles
Alibaba's Qwen team has released Qwen 3.5 Small, a new small dense model in their flagship open-source series. The announcement topped r/LocalLLaMA with over 1,000 upvotes, reflecting the local AI community's enthusiasm for capable small models.
Users on r/LocalLLaMA have spotted Qwen3.5 model names appearing in Alibaba's official Qwen chat interface, signaling an imminent release of the next generation of Alibaba's open-source LLM series.
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
Comments (0)
No comments yet. Be the first to comment!