#evaluation

LLM Reddit Feb 27, 2026 2 min read

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding

A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.

#openai #swe-bench #benchmark

LLM Reddit Feb 25, 2026 2 min read

METR follow-up: from “20% slowdown” to possible AI speedup for expert developers

A Reddit post in r/singularity links METR’s new productivity update, revisiting the widely cited 2025 result that AI slowed experienced open-source developers. The new signal points toward possible speedup, but METR stresses major selection-bias limitations.

#metr #ai-productivity #software-engineering

AI Hacker News Feb 20, 2026 2 min read

HN Highlights Multilingual LLM Guardrail Gaps in Real Humanitarian Scenarios

A high-signal Hacker News discussion points to research arguing that LLM guardrails can behave very differently across languages, with reported score shifts of 36-53% when only policy language changes.

#llm-safety #guardrails #multilingual

LLM Hacker News Feb 17, 2026 1 min read

SkillsBench Finds Self-Generated Agent Skills Add No Average Benefit

A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.

#llm-agents #benchmark #evaluation

AI Feb 16, 2026 2 min read

OpenAI Details Safety Alignment Stack, Reporting 97% Refusal on Uncertain Requests

OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.

#openai #safety #alignment

LLM Feb 16, 2026 2 min read

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months

OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.

#openai #chatgpt #reasoning

LLM Feb 15, 2026 2 min read

NIST Opens Public Comment on Draft AI 800-2 Benchmarking Practices

NIST’s CAISI released draft guidance NIST AI 800-2 for automated language-model benchmark evaluations and opened comments through March 31, 2026. The draft focuses on objective setting, execution methodology, and analysis/reporting quality.

#nist #caisi #benchmarking

LLM Reddit Feb 14, 2026 1 min read

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.

#benchmark #coding-agents #swe-bench