OpenRouter、GPQA・TAU-Benchのlive scoreをtool routingへ接続

open-weight modelをagentに使う判断は、品質や価格の見出しだけでは足りなくなっている。OpenRouterは2026年6月28日のX投稿で、多くのopen-weight modelにGPQAとTAU-Benchを継続実行し、その結果をtool call routingのAutoExactoに使っていると説明した。

"OpenRouter continuously runs GPQA and TAU-Bench on most open-weight models and publishes the results publicly. This informs our AutoExacto meta-benchmark, used by default when routing tool calls. Here, @Parasail_io and @Zai_org rank first."

AutoExactoの説明ページでは、tool付きrequestに標準適用される品質重視routingとして位置づけられている。以前のExactoは手作業で選んだendpoint listだったが、AutoExactoはthroughput、tool-call telemetry、benchmark scoreを約5分ごとに再評価する。新modelの公開直後はproviderごとのばらつきが大きいため、安定していないendpointを自動的に下げる狙いだ。

tweetからリンクされたGLM 5.2ページも重要だ。1M token context、入力$0.94・出力$3 per 1M tokens、provider別performance、uptime、benchmark、app activityを同じ画面に並べている。model catalogが、単なる一覧から運用監視に近い画面へ変わっている。

次に見るべきは、公開benchmark順位が実際のtool-call信頼性とどれほど一致するかだ。GPQA、TAU-Bench、JSON妥当性、schema一致、uptimeが並んで公開され続ければ、model選択は静的leaderboardではなくtraffic engineeringに近づく。

LLM 2d ago 1 min read

Open-weight modelの差は3〜6カ月、OpenRouterが4モデルで整理

OpenRouterは6月のopen-weight modelをDeepSeek V4 Flash、GLM 5.2、MiniMax M3、NVIDIA Nemotron 3 Ultraの4軸で整理した。79.0%のSWE-bench Verified、Intelligence Index 51、1M context、低いserving costが判断材料になる。

#openrouter #open-weight #llm

LLM X/Twitter 1d ago 1 min read

Open-weight 4モデル、安い推論から実運用エージェント基盤へ移った品質・価格競争の新局面へ

Open-weight LLMの争点は、単価比較からエージェント実装の設計へ移っている。OpenRouterはJune 2026の4モデルを挙げ、DeepSeek V4 FlashのSWE-bench 79.0%、GLM 5.2のAA Index 51、1M contextを具体例にした。

#openrouter #open-weight #benchmarks

LLM X/Twitter 4d ago 1 min read

OpenRouter Benchmarks API、エージェントが最新モデル順位を実行時に参照可能に

モデル選択は、静的leaderboardではなく実行時routingの問題になりつつある。OpenRouterはBenchmarks APIでArtificial AnalysisやDesign Arenaを含むlive scoreを取得でき、GLM-5.2がcodingとdesignで上位だと示した。

#openrouter #benchmarks #glm-5.2

Related Articles

Open-weight modelの差は3〜6カ月、OpenRouterが4モデルで整理

Open-weight 4モデル、安い推論から実運用エージェント基盤へ移った品質・価格競争の新局面へ

OpenRouter Benchmarks API、エージェントが最新モデル順位を実行時に参照可能に