OpenRouter links live GPQA and TAU-Bench scores to tool-call routing

Choosing an open-weight model for agents now depends on more than headline quality or price. In a June 28 X post, OpenRouter said it continuously runs GPQA and TAU-Bench on most open-weight models and uses those results inside AutoExacto, its routing system for tool calls.

"OpenRouter continuously runs GPQA and TAU-Bench on most open-weight models and publishes the results publicly. This informs our AutoExacto meta-benchmark, used by default when routing tool calls. Here, @Parasail_io and @Zai_org rank first."

The linked AutoExacto post describes a quality-weighted router that is on by default for requests with tools. Unlike the earlier Exacto mode, which depended on hand-curated endpoint lists, AutoExacto re-evaluates providers roughly every five minutes across throughput, tool-call telemetry, and benchmark scores. OpenRouter says this matters most during the first week of a model launch, when provider variance can be high while serving stacks catch up.

The GLM 5.2 page linked from the tweet shows why this is becoming product infrastructure. It lists a 1M-token context window, $0.94 input and $3 output per 1M tokens, provider-level performance, uptime, benchmark rankings, and app activity in one place. That turns a model page into a live operations view for teams deciding where to route agent traffic.

What to watch next is whether public benchmark ranks predict production tool-call reliability. If OpenRouter keeps exposing GPQA, TAU-Bench, JSON validity, schema matching, and provider uptime together, model selection will look less like a static leaderboard and more like traffic engineering.

LLM 2d ago 2 min read

Open-weight models narrow the gap to 3-6 months, OpenRouter says

OpenRouter’s June review frames open-weight competition around four models: DeepSeek V4 Flash, GLM 5.2, MiniMax M3, and NVIDIA Nemotron 3 Ultra. The numbers that matter are 79.0% on SWE-bench Verified, an Intelligence Index score of 51, 1M-token contexts, and sharply lower serving costs.

#openrouter #open-weight #llm

LLM X/Twitter 1d ago 2 min read

Four open-weight models move from cheap tokens into agent pipelines

Open-weight LLMs are moving from cost comparisons into production agent design. OpenRouter singled out four June 2026 models, including DeepSeek V4 Flash at 79.0% on SWE-bench Verified and GLM 5.2 as the top open model on Artificial Analysis v4.1.

#openrouter #open-weight #benchmarks

LLM X/Twitter 4d ago 2 min read

OpenRouter Benchmarks API lets agents query live model rankings

Model choice is becoming a runtime routing problem instead of a static leaderboard check. OpenRouter says its Benchmarks API exposes live scores, including Artificial Analysis and Design Arena, and points to GLM-5.2 leading both coding and design among available models.

#openrouter #benchmarks #glm-5.2

Related Articles

Open-weight models narrow the gap to 3-6 months, OpenRouter says

Four open-weight models move from cheap tokens into agent pipelines

OpenRouter Benchmarks API lets agents query live model rankings