HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
xAI says Composer 2.5 is now available inside Grok Build. The post describes it as strong at complex instructions and long-running tasks, drawing more than 640K views as coding-agent competition tightens.
Cognition is arguing that coding agents do not have to collapse into model-lab features. It raised more than $1B at a $26B valuation, with Devin’s run-rate revenue reaching $492M.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
xAI’s next Grok foundation model is moving from training into fine-tuning at 1.5T parameters, three times the size of the current 0.5T production model. Musk says Cursor data was added and public release is 2 to 3 weeks away.
OpenAI has released Symphony, an open-source specification that turns issue trackers like Linear into a control plane for autonomous coding agents. The system assigns a Codex agent per task, handles CI, rebasing, and PR management without human oversight.
Hacker News paid attention to Mistral Medium 3.5 because the size-to-capability tradeoff looked real: a 128B dense model with a 256K context window, open weights, and self-hosting claims that do not immediately drift into fantasy. The launch also tied the model to remote coding agents in Vibe and a new Work mode in Le Chat.
LocalLLaMA latched onto one detail immediately: dense 128B. Mistral Medium 3.5 drew attention because it tries to bundle reasoning, coding, and agent work into a model people can still imagine self-hosting.
This was not just another “local models are bad” rant. The thread blew up because it mixed a blunt reality check with a serious counterargument: some of the pain comes from small models, but a lot of it may come from the harness wrapped around them.
HN jumped straight to a sharper question than the score itself: was this a model win or a harness win? Dirac’s 65.2% TerminalBench run turned into a broader argument about context curation, AST-guided search, and why coding agents still live or die on tooling decisions.
The spark in LocalLLaMA was not the raw score alone. The post landed because a 38.2% Terminal-Bench 2.0 result for Qwen 3.6-27B was framed as roughly late-2025 frontier quality, putting air-gapped and privacy-heavy coding teams into a new decision zone.
HN did not read EvanFlow as another shiny agent wrapper so much as a set of brakes for agentic coding. Checkpoints, integration contracts, and explicit no-auto-commit rules drew more attention than the TDD label itself.