Qwen 2.5 → 3 → 3.5: How Alibaba's Smallest Models Have Transformed Across Generations
Original: Qwen 2.5 -> 3 -> 3.5, smallest models. Incredible improvement over the generations. View original →
Three Generations of Density Improvements
Alibaba's Qwen model family has seen extraordinary efficiency gains across generations. A community comparison post on r/LocalLLaMA (score: 681) highlighted just how much has changed from Qwen 2.5 to Qwen 3 to Qwen 3.5 in the smallest model tiers.
Qwen 3 vs. Qwen 2.5
Qwen 3 achieved approximately 50% density improvement over Qwen 2.5: a Qwen3-1.7B performs comparably to Qwen2.5-3B, Qwen3-4B to Qwen2.5-7B, and so on up the scale. This means users can now get the same performance at roughly half the parameter count.
Qwen 3.5 Small Series (0.8B–9B)
Qwen 3.5's small models (0.8B, 2B, 4B, 9B) are all natively multimodal with 262K context. Performance highlights include:
- The 9B model scores 81.7 on GPQA Diamond, outperforming the previous-gen 80B model (77.2)
- The 9B beats GPT-5-Nano by 13+ points on MMMU-Pro and 30+ points on document understanding
- The 2B model scores 84.5 on OCRBench and 75.6 on VideoMME, surpassing many 7B-class models
- The 4B can handle text, images, and video from just 8GB of VRAM
Why This Matters
This trajectory shows how quickly the open-source LLM ecosystem is advancing. Capabilities that once required proprietary models with 70B+ parameters are now achievable with locally-runnable models. For the local AI community, Qwen 3.5 is setting a new standard for what small open-source models can do.
Related Articles
Alibaba launched Qwen3.5, a 397B-parameter open-weight multimodal model supporting 201 languages. The company claims it outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 on benchmarks, while costing 60% less than its predecessor.
Alibaba Qwen team released the Qwen 3.5 small model series (0.8B to 9B). Models run in-browser via WebGPU and show dramatic benchmark improvements over previous generations.
Alibaba launched Qwen 3.5 on February 16 under Apache 2.0, featuring 397B parameters with a sparse MoE architecture (17B active), 256K context, and native multimodal capabilities matching leading US proprietary models on key benchmarks.
Comments (0)
No comments yet. Be the first to comment!