A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
#evaluation
RSS FeedA Reddit post in r/singularity links METR’s new productivity update, revisiting the widely cited 2025 result that AI slowed experienced open-source developers. The new signal points toward possible speedup, but METR stresses major selection-bias limitations.
A high-signal Hacker News discussion points to research arguing that LLM guardrails can behave very differently across languages, with reported score shifts of 36-53% when only policy language changes.
A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.
OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.
OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.
NIST’s CAISI released draft guidance NIST AI 800-2 for automated language-model benchmark evaluations and opened comments through March 31, 2026. The draft focuses on objective setting, execution methodology, and analysis/reporting quality.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.