AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성

7 articles Updated Apr 19, 2026 #benchmarks #agents #evaluation #ai-agents

Current state

Berkeley의 benchmark hacking 분석, IBM VAKRA, AIBuildAI, HWE-Bench, LLM judge reliability, MM-WebAgent, eval-faking 연구를 시간순으로 묶어 agent 평가가 어디서 과장되고 어디서 실제 성능으로 이어지는지 추적합니다.

What changed recently

LLM judge, stakes 한 줄에 unsafe 판정이 30%까지 눈에 띄게 흔들렸다
MM-WebAgent, 이미지·코드·레이아웃을 따로 놀지 않게 묶었다
LLM judge, 문서 33-67%에서 일관성 붕괴를 숨겼다

Key tensions

Optimistic case: AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성 unlocks real, compounding leverage.

Skeptical case: reliability, cost, and control around AI Agent Benchmark Watch: 점수, 도구 사용, judge 신뢰성 remain unresolved.

Signals to watch

Momentum and new coverage around “benchmarks”
Momentum and new coverage around “agents”
Momentum and new coverage around “evaluation”

Timeline

Latest

LLM Apr 19, 2026 1 min read

LLM judge, stakes 한 줄에 unsafe 판정이 30%까지 눈에 띄게 흔들렸다

새 arXiv preprint는 평가 결과의 consequences를 암시하는 한 줄만으로 LLM judge가 더 관대해졌다고 보고했다. 자동 safety·quality benchmark의 취약점이 드러났다.

#llm-evals #ai-safety #benchmarks

Recent development

LLM Apr 18, 2026 1 min read

MM-WebAgent, 이미지·코드·레이아웃을 따로 놀지 않게 묶었다

MM-WebAgent는 AI가 만든 웹페이지가 왜 그럴듯한 조각들의 조합에 머무는지 겨냥한다. 계층형 planning, self-reflection, benchmark, code/data 공개를 통해 code-only 평가를 넘어 multimodal page coherence를 재는 틀을 제시했다.

#web-agents #multimodal #aigc

Recent development

LLM sources.research Apr 17, 2026 1 min read

LLM judge, 문서 33-67%에서 일관성 붕괴를 숨겼다

새 arXiv 논문은 낮은 평균 오류율 뒤에 LLM judge의 per-document 불안정성이 숨어 있음을 보였다. SummEval에서 문서 33-67%가 directed 3-cycle을 하나 이상 보였고, prediction set width는 absolute error와 강하게 맞물렸다.

#llm #evaluation #benchmarks

Recent development

LLM Apr 17, 2026 2 min read

HWE-Bench finds agents fix 70.7% of real hardware bugs

HWE-Bench moves LLM agent evaluation from isolated HDL tasks to repository-scale hardware repairs. The best agent solved 70.7% overall, but performance fell below 65% on complex SoC-level projects.

#agents #hardware #benchmarks

Recent development

LLM Apr 17, 2026 2 min read

AIBuildAI reaches 63.1% medal rate for model-building agents

A new arXiv paper puts a hierarchical agent system at the top of MLE-Bench with a 63.1% medal rate. The result matters because the agent handles design, coding, debugging, training, and tuning from a task description plus data.

#agents #automl #benchmarks

Recent development

LLM Apr 17, 2026 1 min read

IBM VAKRA, tool agent가 무너지는 지점을 실행 환경으로 측정한다

IBM Research의 VAKRA는 agent benchmark를 static Q&A에서 실행 가능한 tool environment로 옮겼다. 62 domains, 8,000+ locally hosted APIs, 3-7 step reasoning chains가 들어가며, 결과는 agent reliability가 아직 tool demo 수준을 넘기 어렵다는 쪽에 가깝다.

#agents #benchmarks #ibm

Recent development

AI Hacker News Apr 12, 2026 1 min read

Berkeley는 왜 AI agent benchmark 숫자를 믿기 어렵다고 말하나

UC Berkeley 연구진은 주요 AI agent benchmark 8종을 감사한 결과, 실제 문제를 풀지 않고도 거의 만점에 가까운 점수를 만들 수 있었다고 밝혔다. 글의 핵심은 leaderboard 수치보다 evaluation 설계와 공격 저항성을 먼저 보라는 것이다.

#benchmarks #ai-agents #evaluation

Share: Long