IBM's VAKRA benchmark exposes where tool agents fail

Agent benchmarks often miss the thing that makes agents hard: the answer is only part of the work. An agent must pick tools, pass the right arguments, retrieve the right evidence, obey constraints and ground its final response in what actually happened. IBM Research’s VAKRA analysis, published on Hugging Face on April 15, 2026, is aimed directly at that gap.

VAKRA is a tool-grounded, executable benchmark for evaluating agents in enterprise-like environments. It gives agents access to more than 8,000 locally hosted APIs backed by real databases across 62 domains, plus domain-aligned document collections. Tasks can require 3-7 step reasoning chains that mix structured API interaction with unstructured retrieval under natural-language tool-use constraints. The point is not just to see whether a model can answer, but whether it can get there through a valid execution trace.

The benchmark is split across four capabilities. API chaining with Business Intelligence APIs has 2,077 test instances across 54 domains. Tool selection with Dashboard APIs has 1,597 instances across 17 domains, with each domain exposing between 6 and 328 tools and an average of 116 tools. Multi-hop reasoning adds 869 instances across 38 domains. The final section, multi-hop multi-source reasoning and policy adherence, has 644 instances across 41 domains and combines APIs, document retrievers, dialog context and source-use policies.

The evaluator reflects how real failures happen. VAKRA executes predicted tool calls in the same environment as the ground truth and compares intermediate tool outputs, not only the final text. For the policy-heavy section, it first checks policy adherence, then tool-call trajectory, then the final response. That design allows alternative but valid tool paths while still penalizing missing evidence, wrong arguments, hallucinated parameters or answers not grounded in the retrieved results.

The findings are a reality check for agent claims. IBM Research says models perform poorly on VAKRA overall. GPT-OSS-120B was strongest in the Business Intelligence API segment, largely from better tool schema understanding. Gemini-3-flash-preview led the tested models across error categories in Dashboard API tool selection. Yet all models degrade as hop depth increases, and policy constraints introduce another failure mode: models either violate the constraint or fail to retrieve enough information. VAKRA’s message is blunt: being able to call tools is not the same as reliable, end-to-end agent behavior.

IBM's VAKRA benchmark exposes where tool agents fail

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

Nemotron 3 Ultra uses 550B MoE design to cut agent costs by 30%

Anthropic’s vuln harness is more workshop jig than boxed scanner