#evals

HN은 “AI cybersecurity is not proof of work”를 단순한 anti-hype 글로 읽지 않았다. 핵심 논쟁은 더 많은 GPU와 더 긴 sampling이 bugs를 찾는 충분조건인지, 아니면 model capability와 threat model이 병목인지였다.

#ai-security #cybersecurity #llm

AI Mar 28, 2026 1 min read

Google DeepMind, 9개 다국가 연구 뒤 harmful manipulation eval toolkit 공개

Google DeepMind는 March 26, 2026 AI 시스템의 harmful manipulation을 측정하는 공개 toolkit을 내놨다고 밝혔다. 회사는 UK, US, India에서 10,000명+가 참여한 9개 연구를 바탕으로 했으며, 이 결과를 Gemini 3 Pro 같은 모델의 safety 평가에도 반영한다고 설명했다.

#google #deepmind #ai-safety

LLM Mar 24, 2026 2 min read

Google DeepMind, AGI 진척 측정용 cognitive framework 공개

Google DeepMind는 AGI 진척을 평가하기 위한 cognitive taxonomy를 발표하고, 이를 실제 benchmark로 연결하기 위한 Kaggle hackathon도 함께 시작했다. 핵심은 단일 headline score 대신 10개 cognitive ability별로 AI를 human baseline과 비교하자는 제안이다.

#deepmind #agi #benchmarks

LLM Hacker News Mar 23, 2026 2 min read

DSPy 채택은 더딘데, 팀들은 왜 같은 LLM 패턴을 다시 만들까

Hacker News에서 주목받은 Skylar Payne의 글은 AI 시스템이 커질수록 팀들이 DSPy의 핵심 패턴을 다시 구현하게 된다고 주장한다. 동시에 HN 토론에서는 Python 중심성, prompt optimization의 위치, evals 설계 비용이 adoption을 늦추는 현실적 이유로 함께 지적됐다.

#dspy #llm-engineering #hacker-news

LLM Hacker News Mar 12, 2026 1 min read

Hacker News가 본 SWE-bench 합격과 mergeable code의 거리

METR의 March 10, 2026 note는 최근 agent가 만든 SWE-bench Verified PR 가운데 test를 통과해도 절반가량은 maintainer review를 넘기지 못한다고 본다. HN은 이를 benchmark score가 아직 scope control, code quality, repo fit을 대신하지 못한다는 경고로 읽었다.

#swe-bench #coding-agents #evals