OpenAI is pitching GPT-5.5 as more than a routine model refresh. With 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and a claim that it keeps GPT-5.4-level latency, the company is resetting expectations for long-running coding agents.
LLM
RSS FeedLocalLLaMA upvoted this because a 27B open model suddenly looked competitive on agent-style work, not because everyone agreed on the benchmark. The thread stayed lively precisely because the result felt important and a little suspicious at the same time.
r/MachineLearning did not reward this post for frontier performance. It took off because a 7.5M-parameter diffusion LM trained on tiny Shakespeare on an M2 Air made a usually intimidating idea feel buildable.
LocalLLaMA was not impressed by another TTS clip so much as by a build log. The post that took off showed Qwen3-TTS running locally in real time, quantized through llama.cpp, with extra alignment work to make subtitles and lip sync behave.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
Sakana AI is trying to sell orchestration itself as a model product, not just a prompt hack around other APIs. In its beta table, fugu-ultra posts 54.2 on SWEPro and 95.1 on GPQAD while shipping behind an OpenAI-compatible API.
r/MachineLearning paid attention because the benchmark did not just crown a winner. It argued that many teams are overpaying for document extraction, then backed that claim with repeated runs, cost-per-success numbers, and a leaderboard where several cheaper models outperformed pricey defaults.
What energized LocalLLaMA was not just another Qwen score jump. It was the claim that changing the agent scaffold moved the same family of local models from 19% to 45% to 78.7%, making benchmark comparisons feel less settled than many assumed.
Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.
LocalLLaMA upvoted this because it felt like real plumbing, not another benchmark screenshot. The excitement was about DeepSeek open-sourcing faster expert-parallel communication and reusable GPU kernels.
Why it matters: an open-weight 27B dense model is now being pitched against much larger coding systems on real agent tasks. Qwen’s own model card lists SWE-bench Verified at 77.2 for Qwen3.6-27B versus 76.2 for Qwen3.5-397B-A17B, with Apache 2.0 licensing.
Why it matters: Moonshot is turning “agent swarm” from a demo phrase into an execution claim with real scale numbers. The Kimi post says one run can coordinate 300 sub-agents across 4,000 steps and return 100-plus files instead of chat transcripts.