#llm-evals

LLM Apr 19, 2026 1 min read

LLM judges miss unsafe answers 30% more when stakes are named

A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.

#llm-evals #ai-safety #benchmarks

AI Reddit Apr 10, 2026 2 min read

Reddit Tests the Claim That Mythos-Style Security Work Needs Frontier Models

A high-scoring LocalLLaMA thread amplified AISLE's claim that smaller open or low-cost models reproduced much of the vulnerability analysis Anthropic highlighted for Mythos. The central Reddit pushback was that reasoning over an isolated vulnerable function is very different from autonomously finding that bug inside a large codebase.

#cybersecurity #mythos #open-models

LLM Mar 12, 2026 2 min read

NIST AI 800-3 formalizes benchmark and generalized accuracy for AI evaluations

NIST says AI 800-3 gives evaluators a clearer statistical framework by separating benchmark accuracy from generalized accuracy and by introducing generalized linear mixed models for uncertainty estimation. The February 19, 2026 report argues that many current benchmark comparisons hide assumptions that can distort procurement, development, and policy decisions.

#nist #llm-evals #benchmarks