#agent-evals

LLM Reddit Apr 4, 2026 1 min read

r/LocalLLaMA, 장기 에이전트 평가용 YC-Bench 결과를 집중 조명

`r/LocalLLaMA`에서 화제가 된 YC-Bench는 모델에게 1년 동안 스타트업을 운영하게 하는 장기 지평선 에이전트 benchmark다. 핵심 결과는 12개 모델 중 3개만 시작 자본을 안정적으로 넘겼고, GLM-5가 훨씬 낮은 비용으로 Claude Opus 4.6에 근접했다는 점이다.

#yc-bench #agent-evals #long-horizon

LLM Mar 28, 2026 2 min read

OpenAI, Promptfoo 인수로 agent security testing을 Frontier에 통합 추진

OpenAI는 March 9, 2026 Promptfoo 인수를 발표했다. 회사는 Promptfoo의 agent security testing과 evaluation 기술을 OpenAI Frontier에 통합해 prompt injection, jailbreak, data leak, tool misuse 같은 enterprise risk를 개발 단계부터 다루겠다고 밝혔다.

#openai #promptfoo #ai-security