#reward-models - Insights

LLM Apr 19, 2026 1 min read

LLM judges miss unsafe answers 30% more when stakes are named

A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.

#llm-evals #ai-safety #benchmarks