IBM VAKRA、tool agentが壊れる箇所を実行環境で測る

Agent benchmarkが難しいのは、answerだけでは仕事の全体を測れないからだ。Agentはtoolを選び、正しいargumentsを渡し、evidenceをretrievalし、constraintsを守り、実際に得た結果に基づいてfinal responseを書く必要がある。IBM ResearchがHugging Faceで2026年4月15日に公開したVAKRA分析は、この隙間を正面から扱っている。

VAKRAはenterprise-like environmentsでagentを評価するtool-grounded executable benchmarkだ。62 domainsにまたがるreal databasesを背後に持つ8,000+ locally hosted APIsと、domain-aligned document collectionsを提供する。Taskは3-7 step reasoning chainsを要求し、structured API interactionとunstructured retrievalをnatural-language tool-use constraintsの下で組み合わせる。Final answerだけでなく、valid execution traceをたどったかが問われる。

Benchmarkは4つのcapabilityで構成される。Business Intelligence APIsを使うAPI chainingは54 domainsの2,077 test instancesを含む。Dashboard APIsによるtool selectionは17 domainsの1,597 instancesで、domainごとに6から328 tools、平均116 toolsが登場する。Multi-hop reasoningは38 domainsの869 instances。最後のmulti-hop multi-source reasoning and policy adherenceは41 domainsの644 instancesで、APIs、document retrievers、dialog context、source-use policiesを組み合わせる。

Evaluatorも実際のfailureに合わせている。VAKRAはpredicted tool callsをground truthと同じenvironmentで実行し、final textだけでなくintermediate tool outputsを比較する。Policyを含むsectionではpolicy adherence、tool-call trajectory、final responseの順でwaterfall evaluationを行う。これにより、異なるtool pathで正しく解けた場合は認めつつ、evidenceの欠落、wrong arguments、hallucinated parameters、groundingされていないanswerを捉えられる。

結果はagent claimsへの現実チェックに近い。IBM ResearchはVAKRA上でmodels perform poorlyと述べる。Business Intelligence API segmentでは、tool schema理解の強さからGPT-OSS-120Bが最も良かった。Dashboard API tool selectionでは、Gemini-3-flash-previewがtested modelsのerror categories全般で前に出た。しかしhop depthが増えると全modelの性能が落ち、policy constraintsはさらに別の失敗を加える。Tool callができることと、end-to-endで信頼できるenterprise agent behaviorはまだ同じではない。

IBM VAKRA、tool agentが壊れる箇所を実行環境で測る

Related Articles

Nemotron 3 Ultra、550B MoEでエージェント推論5倍と30%コスト削減を提示

SWE-bench順位も動く25.7%の欠陥、ベンチマーク監査が論点に

Anthropicの脆弱性発見harness、製品というよりチーム用の設計図