ERNIE 5.1 hits #13 globally while cutting pretraining cost to 6%
Original: Introducing ERNIE 5.1 Preview — now live! 🚀 Ranked #13 globally and #1 among Chinese labs on @arena 's Text Arena. Top-10 worldwide across:… View original →
What the leaderboard tweet actually says
Benchmark brag posts are easy to ignore until they pair rank with cost compression. ERNIE 5.1 Preview did both. In its April 29 X post, Baidu's developer-facing ERNIE account said the model is now No. 13 globally on LMArena Text and No. 1 among Chinese labs, while cutting total parameters to about one-third of ERNIE 5.0, active parameters to about one-half, and pretraining cost to roughly 6% of comparable models.
"Ranked #13 globally and #1 among Chinese labs on Text Arena."
The linked ERNIE blog adds the category-level detail: #9 in Math, #1 in Legal & Government, #4 in Business, Management & Financial Ops, and #7 in Software & IT Services. Baidu also attributes the result to decoupled fully-asynchronous reinforcement learning and scaled agentic post-training. Even if one treats vendor-written leaderboard posts cautiously, the combination of rank and compressed training cost is the signal worth tracking.
Why this matters beyond one Arena update
The Chinese model race is no longer only about absolute size or domestic ranking. Cost-efficient training and strong category performance matter more if labs want to refresh previews quickly and still hold their place against larger rivals. A model that reaches upper-tier Arena placement with a much smaller effective training bill changes how often a lab can iterate and how aggressively it can price API access later.
The ErnieforDevs account usually posts release and evaluation milestones for Baidu's developer stack, so this tweet fits a pattern: ship a preview, validate it in a public ranking, then point developers toward direct testing. What to watch next is whether ERNIE 5.1 Preview shows up in broader third-party benchmarks and products beyond Arena, and whether Baidu discloses enough API or deployment detail to prove the cost-performance story in real workloads. Source: ERNIE source tweet · ERNIE blog post
Related Articles
A LocalLLaMA thread highlighted Gemma 4 31B's unexpectedly strong FoodTruck Bench showing, and the discussion quickly turned to long-horizon planning quality and benchmark reliability.
The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.
A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.
Comments (0)
No comments yet. Be the first to comment!