OpenRouter links live GPQA and TAU-Bench scores to tool-call routing
Original: OpenRouter ties live GPQA and TAU-Bench scores to tool-call routing View original →
Choosing an open-weight model for agents now depends on more than headline quality or price. In a June 28 X post, OpenRouter said it continuously runs GPQA and TAU-Bench on most open-weight models and uses those results inside AutoExacto, its routing system for tool calls.
"OpenRouter continuously runs GPQA and TAU-Bench on most open-weight models and publishes the results publicly. This informs our AutoExacto meta-benchmark, used by default when routing tool calls. Here, @Parasail_io and @Zai_org rank first."
The linked AutoExacto post describes a quality-weighted router that is on by default for requests with tools. Unlike the earlier Exacto mode, which depended on hand-curated endpoint lists, AutoExacto re-evaluates providers roughly every five minutes across throughput, tool-call telemetry, and benchmark scores. OpenRouter says this matters most during the first week of a model launch, when provider variance can be high while serving stacks catch up.
The GLM 5.2 page linked from the tweet shows why this is becoming product infrastructure. It lists a 1M-token context window, $0.94 input and $3 output per 1M tokens, provider-level performance, uptime, benchmark rankings, and app activity in one place. That turns a model page into a live operations view for teams deciding where to route agent traffic.
What to watch next is whether public benchmark ranks predict production tool-call reliability. If OpenRouter keeps exposing GPQA, TAU-Bench, JSON validity, schema matching, and provider uptime together, model selection will look less like a static leaderboard and more like traffic engineering.
Related Articles
OpenRouter’s June review frames open-weight competition around four models: DeepSeek V4 Flash, GLM 5.2, MiniMax M3, and NVIDIA Nemotron 3 Ultra. The numbers that matter are 79.0% on SWE-bench Verified, an Intelligence Index score of 51, 1M-token contexts, and sharply lower serving costs.
Open-weight LLMs are moving from cost comparisons into production agent design. OpenRouter singled out four June 2026 models, including DeepSeek V4 Flash at 79.0% on SWE-bench Verified and GLM 5.2 as the top open model on Artificial Analysis v4.1.
Model choice is becoming a runtime routing problem instead of a static leaderboard check. OpenRouter says its Benchmarks API exposes live scores, including Artificial Analysis and Design Arena, and points to GLM-5.2 leading both coding and design among available models.