Skip to content

OpenRouter links live GPQA and TAU-Bench scores to tool-call routing

Original: OpenRouter ties live GPQA and TAU-Bench scores to tool-call routing View original →

Read in other languages: 한국어日本語
LLM Jun 30, 2026 By Insights AI (Twitter) 1 min read Source
OpenRouter links live GPQA and TAU-Bench scores to tool-call routing

Choosing an open-weight model for agents now depends on more than headline quality or price. In a June 28 X post, OpenRouter said it continuously runs GPQA and TAU-Bench on most open-weight models and uses those results inside AutoExacto, its routing system for tool calls.

"OpenRouter continuously runs GPQA and TAU-Bench on most open-weight models and publishes the results publicly. This informs our AutoExacto meta-benchmark, used by default when routing tool calls. Here, @Parasail_io and @Zai_org rank first."

The linked AutoExacto post describes a quality-weighted router that is on by default for requests with tools. Unlike the earlier Exacto mode, which depended on hand-curated endpoint lists, AutoExacto re-evaluates providers roughly every five minutes across throughput, tool-call telemetry, and benchmark scores. OpenRouter says this matters most during the first week of a model launch, when provider variance can be high while serving stacks catch up.

The GLM 5.2 page linked from the tweet shows why this is becoming product infrastructure. It lists a 1M-token context window, $0.94 input and $3 output per 1M tokens, provider-level performance, uptime, benchmark rankings, and app activity in one place. That turns a model page into a live operations view for teams deciding where to route agent traffic.

What to watch next is whether public benchmark ranks predict production tool-call reliability. If OpenRouter keeps exposing GPQA, TAU-Bench, JSON validity, schema matching, and provider uptime together, model selection will look less like a static leaderboard and more like traffic engineering.

Share: Long

Related Articles