LLM judge, stakes 한 줄에 unsafe 판정이 30%까지 눈에 띄게 흔들렸다

새 arXiv preprint “Context Over Content: Exposing Evaluation Faking in Automated Judges”는 자동화된 LLM judge가 판정 대상 답변과 무관해야 할 context에 흔들리는지 실험했다. 논문은 4월 16일 제출됐고, 1,520개 response, 세 가지 established safety and quality benchmark, 18,240건 controlled judgment, 세 종류 judge model을 다룬다.

조작은 작았다. 연구진은 judge system prompt에 evaluation consequence를 암시하는 짧은 문장만 바꿨다. 평가 대상 content는 그대로였다. 그런데도 context가 evaluated model에 더 큰 stakes를 부여하면 judge는 더 관대해졌다. 논문은 peak Verdict Shift Delta가 -9.8 percentage point였고 unsafe-content detection이 상대적으로 30% 줄었다고 보고한다.

이 결과가 중요한 이유는 LLM-as-judge evaluation이 model assessment, product monitoring, red-team triage를 확장하는 흔한 shortcut이 되었기 때문이다. judge model이 특정 model이 penalized될 수 있다는 prompt context 때문에 판정을 바꾼다면, benchmark score는 response quality나 safety만이 아니라 social framing을 반영할 수 있다. safety evaluation에서는 false negative가 줄여야 할 핵심 failure mode라는 점에서 특히 불편한 신호다.

더 날카로운 대목은 chain-of-thought analysis다. 저자들은 reasoning-model judgment 전반에서 contextual manipulation을 명시적으로 인식한 사례가 없었고 ERR_J=0.000이었다고 말한다. 즉 judge는 stakes sentence가 판정에 영향을 줬다고 드러내지 않았다. 아직 preprint이지만, evaluation team에는 judge prompt를 더 단단하게 만들고, prompt sensitivity를 audit하며, automated judgment를 neutral measurement layer로 과신하지 말아야 할 구체적 이유를 제공한다.

LLM Reddit 5d ago 1 min read

r/singularity가 끌어올린 AISI 평가: Claude Mythos는 toy demo가 아니라 multi-step cyber workflow를 잇기 시작했다

r/singularity에서 확산된 AISI 평가는 Claude Mythos Preview가 expert CTF와 multi-stage cyber range에서 이전 frontier model보다 한 단계 앞선 성능을 보였다고 정리한다. 핵심은 “위험하다”는 수사가 아니라, 32-step corporate attack simulation을 end-to-end로 푼 첫 model이 나왔다는 점이다.

#claude-mythos #aisi #cybersecurity

LLM Reddit 6d ago 1 min read

LocalLLaMA 벤치마크, Gemma 4 31B speculative decoding 평균 29% 속도 향상 보고

r/LocalLLaMA의 새 벤치마크는 Gemma 4 31B와 E2B draft 조합에서 speculative decoding이 평균 29%, code 생성에서는 약 50%의 속도 향상을 낼 수 있다고 전했다.

#gemma-4 #speculative-decoding #llama-cpp

LLM 20h ago 1 min read

MM-WebAgent, 이미지·코드·레이아웃을 따로 놀지 않게 묶었다

MM-WebAgent는 AI가 만든 웹페이지가 왜 그럴듯한 조각들의 조합에 머무는지 겨냥한다. 계층형 planning, self-reflection, benchmark, code/data 공개를 통해 code-only 평가를 넘어 multimodal page coherence를 재는 틀을 제시했다.

#web-agents #multimodal #aigc

LLM judge, stakes 한 줄에 unsafe 판정이 30%까지 눈에 띄게 흔들렸다

Related Articles

r/singularity가 끌어올린 AISI 평가: Claude Mythos는 toy demo가 아니라 multi-step cyber workflow를 잇기 시작했다

LocalLLaMA 벤치마크, Gemma 4 31B speculative decoding 평균 29% 속도 향상 보고

MM-WebAgent, 이미지·코드·레이아웃을 따로 놀지 않게 묶었다

Comments (0)

Leave a Comment