#swe-bench

LLM Reddit 12h ago 2 min read

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.

#swe-bench #benchmarks #contamination

LLM Hacker News 20h ago 2 min read

HN Turns on SWE-bench Verified as Contamination Overtakes the Score

HN piled in because this was bigger than another benchmark refresh. OpenAI said SWE-bench Verified is no longer a trustworthy frontier coding signal, and the thread immediately shifted to contamination, saturated leaderboards, and whether public coding evals can stay clean at all.

#swe-bench #evals #coding-agents

AI sources.twitter Apr 17, 2026 2 min read

Qwen3.6-35B-A3B opens 35B MoE weights with 3B active parameters

Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.

#qwen #open-weights #moe

LLM Hacker News Apr 8, 2026 2 min read

Hacker News Sees GLM-5.1 Push Further Into Long-Horizon Agentic Engineering

Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.

#glm-5.1 #agentic-coding #swe-bench

LLM Hacker News Mar 14, 2026 2 min read

Hacker News Debates Whether LLM Coding Progress Has Stalled on Maintainer Merge Rates

A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.

#swe-bench #coding-agents #evaluation

LLM Hacker News Mar 12, 2026 1 min read

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.

#swe-bench #coding-agents #evals

LLM Reddit Mar 4, 2026 2 min read

LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.

#swe-bench #coding-agents #qwen

LLM Reddit Feb 27, 2026 2 min read

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding

A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.

#openai #swe-bench #benchmark

LLM Hacker News Feb 17, 2026 1 min read

HN Spotlight: New arXiv Study Questions Whether AGENTS.md Helps Coding Agents

A Hacker News discussion highlights arXiv:2602.11988, which finds that repository context files like AGENTS.md often reduced coding-agent task success while increasing inference cost by more than 20%.

#coding-agents #agents-md #swe-bench

LLM Reddit Feb 14, 2026 1 min read

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.

#benchmark #coding-agents #swe-bench