#evals

LLM Reddit Jun 12, 2026 1 min read

Papers with Code now has to track “papers without code”

The r/MachineLearning thread captured a practical benchmark problem: closed models dominate eval tables even when their results are not reproducible in the old Papers with Code sense.

#benchmarks #open-source #leaderboards

LLM Hacker News Jun 10, 2026 1 min read

FrontierCode Asks Whether an AI Patch Would Actually Get Merged

HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.

#coding-agents #benchmark #evals

AI Hacker News Jun 4, 2026 1 min read

A $1,500 LLM hacking test exposes the gap between capability, guardrails, and harnesses

HN focused less on the leaderboard and more on how refusals, tool loops, and account permissions shaped the result.

#llm-security #pentesting #firebase

LLM Hacker News Apr 28, 2026 2 min read

HN thinks the SWE-bench story is about contamination, not bragging rights

HN treated OpenAI's post less as benchmark housekeeping and more as an obituary for a famous coding leaderboard. The thread cared far more about flawed tests and contamination than about who happened to top the chart first.

#openai #swe-bench #evals

AI Hacker News Apr 26, 2026 2 min read

HN Greets LamBench With Curiosity, Then Starts Arguing About One-Shot Scoring

HN liked the premise of a fresh benchmark, then immediately started arguing about whether single-shot scoring tells the truth about coding models.

#benchmarks #lambda-calculus #evals

LLM Hacker News Apr 24, 2026 3 min read

HN Sees Anthropic's Claude Code Postmortem as a Product-Layer Failure, Not a Model Collapse

Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.

#anthropic #claude-code #postmortem

AI Hacker News Apr 18, 2026 2 min read

HN asked whether AI bug hunting is really just more tokens

HN treated “AI cybersecurity is not proof of work” as a serious argument about search, model capability, and security asymmetry. The thread pushed past hype into a harder question: when an LLM flags a bug, did it understand the exploit path or just sample a suspicious pattern?

#ai-security #cybersecurity #llm

AI Mar 28, 2026 2 min read

Google DeepMind publishes a harmful-manipulation eval toolkit after nine multi-country studies

Google DeepMind said on March 26, 2026 that it is releasing a public toolkit to measure harmful manipulation by AI systems. The company says the work spans nine studies with more than 10,000 participants and now informs safety evaluations for models including Gemini 3 Pro.

#google #deepmind #ai-safety

LLM Mar 24, 2026 2 min read

Google DeepMind Proposes a Cognitive Framework for Measuring AGI Progress

Google DeepMind has published a cognitive taxonomy for evaluating progress toward AGI and paired it with a Kaggle hackathon to build new benchmarks. The framework maps AI systems against human baselines across 10 cognitive abilities instead of relying on a single headline score.

#deepmind #agi #benchmarks

108

LLM Hacker News Mar 23, 2026 2 min read

Why Teams Rebuild DSPy Patterns Even as Adoption Lags

A Hacker News thread around Skylar Payne's DSPy post argues that teams often rebuild DSPy-style LLM engineering patterns as systems mature, even though unfamiliar abstractions, Python fit, and eval design still slow direct adoption.

#dspy #llm-engineering #hacker-news

103

LLM Mar 12, 2026 2 min read

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study

OpenAI introduced a new evaluation suite and research paper on Chain-of-Thought controllability. The company says GPT-5.4 Thinking shows low ability to obscure its reasoning, which supports continued use of CoT monitoring as a safety signal.

#openai #reasoning #safety

LLM Hacker News Mar 12, 2026 1 min read

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.

#swe-bench #coding-agents #evals

117