IBM's VAKRA benchmark exposes where tool agents fail

Original: Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents View original →

Read in other languages: 한국어日本語
LLM Apr 17, 2026 By Insights AI 2 min read 2 views Source

Agent benchmarks often miss the thing that makes agents hard: the answer is only part of the work. An agent must pick tools, pass the right arguments, retrieve the right evidence, obey constraints and ground its final response in what actually happened. IBM Research’s VAKRA analysis, published on Hugging Face on April 15, 2026, is aimed directly at that gap.

VAKRA is a tool-grounded, executable benchmark for evaluating agents in enterprise-like environments. It gives agents access to more than 8,000 locally hosted APIs backed by real databases across 62 domains, plus domain-aligned document collections. Tasks can require 3-7 step reasoning chains that mix structured API interaction with unstructured retrieval under natural-language tool-use constraints. The point is not just to see whether a model can answer, but whether it can get there through a valid execution trace.

The benchmark is split across four capabilities. API chaining with Business Intelligence APIs has 2,077 test instances across 54 domains. Tool selection with Dashboard APIs has 1,597 instances across 17 domains, with each domain exposing between 6 and 328 tools and an average of 116 tools. Multi-hop reasoning adds 869 instances across 38 domains. The final section, multi-hop multi-source reasoning and policy adherence, has 644 instances across 41 domains and combines APIs, document retrievers, dialog context and source-use policies.

The evaluator reflects how real failures happen. VAKRA executes predicted tool calls in the same environment as the ground truth and compares intermediate tool outputs, not only the final text. For the policy-heavy section, it first checks policy adherence, then tool-call trajectory, then the final response. That design allows alternative but valid tool paths while still penalizing missing evidence, wrong arguments, hallucinated parameters or answers not grounded in the retrieved results.

The findings are a reality check for agent claims. IBM Research says models perform poorly on VAKRA overall. GPT-OSS-120B was strongest in the Business Intelligence API segment, largely from better tool schema understanding. Gemini-3-flash-preview led the tested models across error categories in Dashboard API tool selection. Yet all models degrade as hop depth increases, and policy constraints introduce another failure mode: models either violate the constraint or fail to retrieve enough information. VAKRA’s message is blunt: being able to call tools is not the same as reliable, end-to-end agent behavior.

Share: Long

Related Articles

LLM Reddit Apr 12, 2026 2 min read

A r/LocalLLaMA thread quickly elevated MiniMax M2.7 because the Hugging Face release is framed less as a chat model and more as an agent system with tool use, Agent Teams, and ready-made deployment guides. Early interest is as much about operational packaging as about the benchmark numbers themselves.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.