IBM VAKRA, tool agent가 무너지는 지점을 실행 환경으로 측정한다

Agent benchmark가 어려운 이유는 “정답을 맞혔는가”만으로는 실제 업무를 설명하기 부족하기 때문이다. IBM Research가 Hugging Face에 2026년 4월 15일 공개한 VAKRA 분석은 이 문제를 정면으로 다룬다. VAKRA는 agent가 enterprise-like environment에서 API와 document를 사용해 reasoning하고 행동하는지 평가하는 executable benchmark다.

규모는 작지 않다. VAKRA는 62 domains에 걸친 real databases와 domain-aligned document collections를 바탕으로 8,000+ locally hosted APIs를 제공한다. Task는 보통 3-7 step reasoning chains를 요구하며, structured API interaction과 unstructured retrieval을 natural-language tool-use constraints 아래에서 섞는다. 단순히 final answer만 채점하는 대신, tool-call trajectory와 intermediate result까지 평가한다.

Benchmark는 네 가지 capability로 나뉜다. API chaining using Business Intelligence APIs는 54 domains의 2,077 test instances를 포함하고, tool selection using Dashboard APIs는 17 domains의 1,597 instances를 다룬다. Multi-hop reasoning은 38 domains의 869 instances이며, multi-hop multi-source reasoning and policy adherence는 41 domains의 644 instances를 포함한다. 특히 dashboard API task에서는 domain마다 최소 6개에서 최대 328개 tools, 평균 116개 tools가 등장한다.

평가 방식도 agent 개발자에게 익숙한 실패를 잡아내도록 설계됐다. VAKRA evaluator는 예측된 tool call을 같은 environment에서 실행해 ground truth의 tool response와 비교한다. Capability 4에서는 policy adherence를 먼저 검증하고, 그 다음 tool sequence와 final response를 확인하는 waterfall pipeline을 쓴다. 이 구조 덕분에 모델이 다른 경로로 문제를 풀더라도 필요한 정보를 모두 회수했는지를 판단할 수 있다.

결과는 낙관적이지 않다. IBM Research는 VAKRA에서 models perform poorly라고 요약했다. Business Intelligence API segment에서는 GPT-OSS-120B가 tool schema 이해 덕분에 가장 강했고, dashboard API tool selection에서는 Gemini-3-flash-preview가 tested models 중 error categories 전반에서 앞섰다. 하지만 hop depth가 늘면 성능이 떨어지고, policy constraints가 들어오면 대부분의 model이 추가로 흔들린다. 결론은 분명하다. Tool call을 한두 번 잘하는 능력과, APIs, documents, dialog context, policy requirements를 함께 다루는 end-to-end reliability 사이에는 아직 큰 간격이 있다.

IBM VAKRA, tool agent가 무너지는 지점을 실행 환경으로 측정한다

Related Articles

r/LocalLLaMA가 본 MiniMax M2.7, chat model보다 agent system에 가깝다

HWE-Bench, 실제 hardware bug에서 agent 70.7% 수리율을 재다

AIBuildAI, MLE-Bench 63.1%로 AI model 제작 자동화를 겨냥

Comments (0)

Leave a Comment

Related Articles

r/LocalLLaMA가 본 MiniMax M2.7, chat model보다 agent system에 가깝다
LLM Reddit Apr 12, 2026 1 min read

HWE-Bench, 실제 hardware bug에서 agent 70.7% 수리율을 재다
HWE-Bench는 LLM agent 평가를 작은 HDL 문제에서 repository-scale hardware repair로 옮겼다. 최고 agent는 전체 70.7%를 해결했지만, 복잡한 SoC-level project에서는 65% 아래로 떨어졌다.

AIBuildAI, MLE-Bench 63.1%로 AI model 제작 자동화를 겨냥