#evals

LLM Hacker News 8h ago 2 min read

HN Turns on SWE-bench Verified as Contamination Overtakes the Score

HN piled in because this was bigger than another benchmark refresh. OpenAI said SWE-bench Verified is no longer a trustworthy frontier coding signal, and the thread immediately shifted to contamination, saturated leaderboards, and whether public coding evals can stay clean at all.

#swe-bench #evals #coding-agents

AI Hacker News 1d ago 2 min read

HN Greets LamBench With Curiosity, Then Starts Arguing About One-Shot Scoring

HN liked the premise of a fresh benchmark, then immediately started arguing about whether single-shot scoring tells the truth about coding models.

#benchmarks #lambda-calculus #evals

AI sources.twitter 1d ago 1 min read

Anthropic’s 69-person market test found stronger agents win quietly

Anthropic’s new agent-market experiment matters because it turns model quality into money. In a 69-person office marketplace, Claude agents closed 186 deals worth just over $4,000, and Opus-backed users got better prices without noticing.

#anthropic #claude #agents

LLM Hacker News 3d ago 3 min read

HN Sees Anthropic's Claude Code Postmortem as a Product-Layer Failure, Not a Model Collapse

Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.

#anthropic #claude-code #postmortem

AI Hacker News Apr 18, 2026 2 min read

HN asked whether AI bug hunting is really just more tokens

HN treated “AI cybersecurity is not proof of work” as a serious argument about search, model capability, and security asymmetry. The thread pushed past hype into a harder question: when an LLM flags a bug, did it understand the exploit path or just sample a suspicious pattern?

#ai-security #cybersecurity #llm

LLM sources.twitter Mar 29, 2026 2 min read

Cursor says real-time RL lets Composer ship better checkpoints every five hours

Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer checkpoints as often as every five hours. Cursor's research post says the loop trains on billions of production tokens from real user interactions, runs evals including CursorBench before deployment, and has already shown gains in edit persistence, dissatisfied follow-ups, and latency.

#cursor #composer #reinforcement-learning

AI Mar 28, 2026 2 min read

Google DeepMind publishes a harmful-manipulation eval toolkit after nine multi-country studies

Google DeepMind said on March 26, 2026 that it is releasing a public toolkit to measure harmful manipulation by AI systems. The company says the work spans nine studies with more than 10,000 participants and now informs safety evaluations for models including Gemini 3 Pro.

#google #deepmind #ai-safety

LLM Mar 24, 2026 2 min read

Google DeepMind Proposes a Cognitive Framework for Measuring AGI Progress

Google DeepMind has published a cognitive taxonomy for evaluating progress toward AGI and paired it with a Kaggle hackathon to build new benchmarks. The framework maps AI systems against human baselines across 10 cognitive abilities instead of relying on a single headline score.

#deepmind #agi #benchmarks

LLM Hacker News Mar 23, 2026 2 min read

Why Teams Rebuild DSPy Patterns Even as Adoption Lags

A Hacker News thread around Skylar Payne's DSPy post argues that teams often rebuild DSPy-style LLM engineering patterns as systems mature, even though unfamiliar abstractions, Python fit, and eval design still slow direct adoption.

#dspy #llm-engineering #hacker-news

LLM Mar 15, 2026 2 min read

OpenAI to acquire Promptfoo and fold agent security testing into Frontier

On March 9, 2026, OpenAI said it plans to acquire Promptfoo and integrate its AI security tooling into OpenAI Frontier. The move pushes security testing, red-teaming, and governance closer to the default workflow for enterprise agents.

#openai #promptfoo #ai-security

LLM Mar 12, 2026 2 min read

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study

OpenAI introduced a new evaluation suite and research paper on Chain-of-Thought controllability. The company says GPT-5.4 Thinking shows low ability to obscure its reasoning, which supports continued use of CoT monitoring as a safety signal.

#openai #reasoning #safety

LLM Hacker News Mar 12, 2026 1 min read

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.

#swe-bench #coding-agents #evals