A new paper discussed in r/MachineLearning argues that unofficial model-access providers can quietly substitute models and distort both research and production results.
#evaluation
OpenAI said it will acquire Promptfoo and fold its security and evaluation stack into OpenAI Frontier. The company said Promptfoo will remain open source and current customers will continue to receive support.
OpenAI said it published a new Chain-of-Thought controllability evaluation suite and research paper. The company reports that GPT-5.4 Thinking showed limited ability to obscure its reasoning, supporting chain-of-thought monitoring as a practical safety mechanism.
In a January 21, 2026 engineering post, Anthropic explained how it repeatedly redesigned a take-home performance test as Claude models improved. The company describes how Opus 4 and Opus 4.5 changed the evaluation baseline and forced process-level updates.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
A Reddit post in r/singularity links METR’s new productivity update, revisiting the widely cited 2025 result that AI slowed experienced open-source developers. The new signal points toward possible speedup, but METR stresses major selection-bias limitations.
A high-signal Hacker News discussion points to research arguing that LLM guardrails can behave very differently across languages, with reported score shifts of 36-53% when only policy language changes.
A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.
OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.
OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.
NIST’s CAISI released draft guidance NIST AI 800-2 for automated language-model benchmark evaluations and opened comments through March 31, 2026. The draft focuses on objective setting, execution methodology, and analysis/reporting quality.