#coding-agents

LLM Reddit 43m ago 2 min read

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.

#swe-bench #benchmarks #contamination

LLM 2h ago 2 min read

GitHub Copilot gets GPT-5.5, but the 7.5x multiplier changes the math

GitHub is rolling GPT-5.5 into Copilot across IDEs, CLI, mobile, github.com, and the cloud agent, turning OpenAI's latest model into a daily coding option instead of a release-note headline. The catch is a 7.5x premium request multiplier, and Business or Enterprise admins must explicitly enable access.

#github #copilot #gpt-5-5

LLM Hacker News 8h ago 2 min read

HN Turns on SWE-bench Verified as Contamination Overtakes the Score

HN piled in because this was bigger than another benchmark refresh. OpenAI said SWE-bench Verified is no longer a trustworthy frontier coding signal, and the thread immediately shifted to contamination, saturated leaderboards, and whether public coding evals can stay clean at all.

#swe-bench #evals #coding-agents

LLM 22h ago 2 min read

Cursor puts GPT-5.5 atop CursorBench at 72.8% and halves price

Why it matters: public coding benchmarks are getting less useful at the frontier, so a fresh product-side score can move developer attention fast. Cursor says GPT-5.5 is now its top model on CursorBench at 72.8% and is discounting usage by 50% through May 2.

#cursor #gpt-5-5 #benchmarks

LLM Hacker News 2d ago 3 min read

HN Reads Zed's Parallel Agents Launch as a Bet on Worktrees, Not Just More AI Panels

Hacker News liked that Zed did more than add extra agents to a sidebar. The thread focused on worktree isolation, repo scoping, and whether Zed found a more usable shape for multi-agent coding than the usual terminal pile-up. By crawl time on April 25, 2026, the post had 278 points and 160 comments.

#zed #coding-agents #worktrees

LLM Reddit 3d ago 2 min read

LocalLLaMA Rallies Around a Qwen3.6 Result That Puts the Scaffold on Trial

What energized LocalLLaMA was not just another Qwen score jump. It was the claim that changing the agent scaffold moved the same family of local models from 19% to 45% to 78.7%, making benchmark comparisons feel less settled than many assumed.

#qwen #coding-agents #benchmarks

LLM Hacker News 3d ago 3 min read

HN Sees Anthropic's Claude Code Postmortem as a Product-Layer Failure, Not a Model Collapse

Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.

#anthropic #claude-code #postmortem

LLM sources.twitter 3d ago 2 min read

Kimi K2.6 scales agent swarms to 300 workers and 4,000 coordinated steps

Why it matters: Moonshot is turning “agent swarm” from a demo phrase into an execution claim with real scale numbers. The Kimi post says one run can coordinate 300 sub-agents across 4,000 steps and return 100-plus files instead of chat transcripts.

#moonshot #kimi #agent-swarm

LLM Hacker News 3d ago 2 min read

HN Fixates on “Over-Editing”: When Coding Models Rewrite More Than the Bug

HN latched onto a pain every heavy coding-tool user knows: the bug is tiny, but the diff balloons anyway. A new write-up turns that annoyance into a measurable benchmark and argues that better prompting and RL can make models edit with more restraint.

#coding-agents #minimal-editing #code-review

LLM 4d ago 2 min read

Qwen3.6-Max-Preview pushes coding benchmarks, but stays cloud-only

Alibaba’s April 22 Qwen3.6-Max-Preview post claims top scores across six coding benchmarks and clear gains over Qwen3.6-Plus. The caveat is just as important: this is a hosted proprietary preview, not a new open-weight Qwen release.

#qwen #alibaba #coding-agents

LLM 4d ago 2 min read

Copilot pauses sign-ups as agent workloads break plan math

GitHub has paused new Copilot Pro, Pro+, and Student sign-ups after agentic workflows pushed compute demand beyond the old plan structure. The sharper signal is economic: token-based session and weekly limits now matter separately from premium request counts.

#github #copilot #coding-agents

LLM Hacker News 4d ago 2 min read

Kimi K2.6 turned HN’s model debate toward open-weight coding agents

HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.

#kimi #coding-agents #open-weights