Arena turns 10M model votes into a $100M AI-evaluation business
Original: Arena, the AI leaderboard everyone uses, is now a $100M business View original →
A free leaderboard has become a price signal for the model economy. Arena, the AI model comparison service that began as a UC Berkeley research project, has reached a $100 million annualized revenue run rate eight months after launching its commercial evaluation product.
The public site is familiar to AI developers: a user enters a prompt, receives outputs from two models, and chooses the better response. TechCrunch reports that Arena’s leaderboard is now built from more than 10 million user evaluations. Those comparisons are valuable because they measure model performance in the messy, preference-driven way buyers and builders actually experience it.
Arena began monetizing in September 2025 with AI Evaluations, a service that sells deeper performance analytics to model labs and enterprises. The company calls the milestone ARR, though CEO Anastasios Angelopoulos clarified that customers are charged on a consumption basis rather than through a classic recurring subscription. That distinction matters for finance teams, but it does not reduce the strategic signal: model evaluation is now a large commercial category.
The growth also shows why leaderboard traffic matters. Model labs need feedback for post-training and product positioning, while enterprises need evidence that a model performs well on their own mix of writing, coding, vision, image generation and agent tasks. Arena has expanded beyond basic chat battles into specialized rankings and Agent Mode for more complex workflows.
The revenue trajectory is steep. When Arena raised a $150 million Series A in January 2026 at a $1.7 billion post-money valuation, its annualized revenue was reported at $30 million. Moving to $100 million within months suggests that evaluation data is becoming a bottleneck budget item, not a side project. The next question is whether crowdsourced preference data remains enough as enterprises demand private, domain-specific scoring for production model selection.
Related Articles
Model choice is becoming a runtime routing problem instead of a static leaderboard check. OpenRouter says its Benchmarks API exposes live scores, including Artificial Analysis and Design Arena, and points to GLM-5.2 leading both coding and design among available models.
OpenRouter’s June review frames open-weight competition around four models: DeepSeek V4 Flash, GLM 5.2, MiniMax M3, and NVIDIA Nemotron 3 Ultra. The numbers that matter are 79.0% on SWE-bench Verified, an Intelligence Index score of 51, 1M-token contexts, and sharply lower serving costs.
Open-weight LLMs are moving from cost comparisons into production agent design. OpenRouter singled out four June 2026 models, including DeepSeek V4 Flash at 79.0% on SWE-bench Verified and GLM 5.2 as the top open model on Artificial Analysis v4.1.