A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.
#llm-evals
RSS FeedLLM Apr 19, 2026 1 min read
AI Reddit Apr 10, 2026 2 min read
A high-scoring LocalLLaMA thread amplified AISLE's claim that smaller open or low-cost models reproduced much of the vulnerability analysis Anthropic highlighted for Mythos. The central Reddit pushback was that reasoning over an isolated vulnerable function is very different from autonomously finding that bug inside a large codebase.
LLM Mar 12, 2026 2 min read
NIST says AI 800-3 gives evaluators a clearer statistical framework by separating benchmark accuracy from generalized accuracy and by introducing generalized linear mixed models for uncertainty estimation. The February 19, 2026 report argues that many current benchmark comparisons hide assumptions that can distort procurement, development, and policy decisions.