#swe-bench

LLM May 27, 2026 2 min read

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.

#benchmarks #swe-bench #agents

LLM Hacker News Apr 28, 2026 2 min read

HN thinks the SWE-bench story is about contamination, not bragging rights

HN treated OpenAI's post less as benchmark housekeeping and more as an obituary for a famous coding leaderboard. The thread cared far more about flawed tests and contamination than about who happened to top the chart first.

#openai #swe-bench #evals

LLM Reddit Apr 27, 2026 2 min read

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.

#swe-bench #benchmarks #contamination

AI X/Twitter Apr 17, 2026 2 min read

Qwen3.6-35B-A3B opens 35B MoE weights with 3B active parameters

Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.

#qwen #open-weights #moe

LLM Hacker News Apr 8, 2026 2 min read

Hacker News Sees GLM-5.1 Push Further Into Long-Horizon Agentic Engineering

Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.

#glm-5.1 #agentic-coding #swe-bench

LLM Hacker News Mar 14, 2026 2 min read

Hacker News Debates Whether LLM Coding Progress Has Stalled on Maintainer Merge Rates

A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.

#swe-bench #coding-agents #evaluation

LLM Hacker News Mar 12, 2026 1 min read

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.

#swe-bench #coding-agents #evals

LLM Reddit Mar 4, 2026 2 min read

LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.

#swe-bench #coding-agents #qwen

LLM Reddit Feb 27, 2026 2 min read

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding

A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.

#openai #swe-bench #benchmark

LLM Hacker News Feb 17, 2026 1 min read

HN Spotlight: New arXiv Study Questions Whether AGENTS.md Helps Coding Agents

A Hacker News discussion highlights arXiv:2602.11988, which finds that repository context files like AGENTS.md often reduced coding-agent task success while increasing inference cost by more than 20%.

#coding-agents #agents-md #swe-bench

LLM Reddit Feb 14, 2026 1 min read

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.

#benchmark #coding-agents #swe-bench