Skip to content

AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성

7 articles Updated Apr 19, 2026 #benchmarks#agents#evaluation#ai-agents

Current state

Berkeley의 benchmark hacking 분석, IBM VAKRA, AIBuildAI, HWE-Bench, LLM judge reliability, MM-WebAgent, eval-faking 연구를 시간순으로 묶어 agent 평가가 어디서 과장되고 어디서 실제 성능으로 이어지는지 추적합니다.

What changed recently

  • LLM judge, stakes 한 줄에 unsafe 판정이 30%까지 눈에 띄게 흔들렸다
  • MM-WebAgent, 이미지·코드·레이아웃을 따로 놀지 않게 묶었다
  • LLM judge, 문서 33-67%에서 일관성 붕괴를 숨겼다

Key tensions

Optimistic case: AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성 unlocks real, compounding leverage.
Skeptical case: reliability, cost, and control around AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성 remain unresolved.

Signals to watch

  • Momentum and new coverage around “benchmarks”
  • Momentum and new coverage around “agents”
  • Momentum and new coverage around “evaluation”

Timeline

Latest
Recent development
Recent development
Recent development
Recent development
Recent development
Recent development
Share: Long