#coding-agents

LLM Hacker News 18h ago 1 min read

FrontierCode Asks Whether an AI Patch Would Actually Get Merged

HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.

#coding-agents #benchmark #evals

LLM X/Twitter Jun 2, 2026 1 min read

Composer 2.5 enters Grok Build as xAI’s long-task coding model

xAI says Composer 2.5 is now available inside Grok Build. The post describes it as strong at complex instructions and long-running tasks, drawing more than 640K views as coding-agent competition tightens.

#xai #grok #coding-agents

LLM May 29, 2026 1 min read

Devin hits $492M run-rate as Cognition bets on independent agents

Cognition is arguing that coding agents do not have to collapse into model-lab features. It raised more than $1B at a $26B valuation, with Devin’s run-rate revenue reaching $492M.

#cognition #devin #coding-agents

LLM X/Twitter May 28, 2026 2 min read

DeepSWE’s 113 tasks put GPT-5.5 at 70% and Claude Opus 4.7 at 54%

DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.

#deepswe #coding-agents #benchmark

LLM X/Twitter May 26, 2026 1 min read

Grok V9-Medium completes 1.5T training, release due in 2-3 weeks

xAI’s next Grok foundation model is moving from training into fine-tuning at 1.5T parameters, three times the size of the current 0.5T production model. Musk says Cursor data was added and public release is 2 to 3 weeks away.

#xai #grok #model-training

LLM May 2, 2026 1 min read

OpenAI Open-Sources Symphony, a Coding Agent Orchestration Layer

OpenAI has released Symphony, an open-source specification that turns issue trackers like Linear into a control plane for autonomous coding agents. The system assigns a Codex agent per task, handles CI, rebasing, and PR management without human oversight.

#openai #open-source #coding-agents

LLM Hacker News Apr 30, 2026 2 min read

HN cared less about the launch copy than the 128B and 256K math behind Mistral Medium 3.5

Hacker News paid attention to Mistral Medium 3.5 because the size-to-capability tradeoff looked real: a 128B dense model with a 256K context window, open weights, and self-hosting claims that do not immediately drift into fantasy. The launch also tied the model to remote coding agents in Vibe and a new Work mode in Le Chat.

#mistral #open-weights #coding-agents

LLM Reddit Apr 30, 2026 2 min read

LocalLLaMA locks onto one word in Mistral Medium 3.5: dense

LocalLLaMA latched onto one detail immediately: dense 128B. Mistral Medium 3.5 drew attention because it tries to bundle reasoning, coding, and agent work into a model people can still imagine self-hosting.

#mistral #llm #open-weights

LLM Reddit Apr 29, 2026 2 min read

675 comments later, LocalLLaMA is still arguing about whether local coding LLMs are worth it

This was not just another “local models are bad” rant. The thread blew up because it mixed a blunt reality check with a serious counterargument: some of the pain comes from small models, but a lot of it may come from the harness wrapped around them.

#local-llm #coding-agents #developer-tools

LLM Hacker News Apr 29, 2026 1 min read

Dirac’s 65.2% TerminalBench run turned HN toward the harness, not just the model

HN jumped straight to a sharper question than the score itself: was this a model win or a harness win? Dirac’s 65.2% TerminalBench run turned into a broader argument about context curation, AST-guided search, and why coding agents still live or die on tooling decisions.

#coding-agents #benchmark #terminalbench

LLM Reddit Apr 28, 2026 2 min read

LocalLLaMA sees 38.2% as the moment local coding stops feeling theoretical

The spark in LocalLLaMA was not the raw score alone. The post landed because a 38.2% Terminal-Bench 2.0 result for Qwen 3.6-27B was framed as roughly late-2025 frontier quality, putting air-gapped and privacy-heavy coding teams into a new decision zone.

#qwen #terminal-bench #local-llms

LLM Hacker News Apr 28, 2026 2 min read

HN likes EvanFlow for the parts it refuses to automate

HN did not read EvanFlow as another shiny agent wrapper so much as a set of brakes for agentic coding. Checkpoints, integration contracts, and explicit no-auto-commit rules drew more attention than the TDD label itself.

#claude-code #tdd #coding-agents