OpenAI moved Codex computer use onto Windows, changing where coding agents can act. The May 29 post passed 950,000 views and says ChatGPT mobile can start, review, and redirect work running on a Windows machine.
LLM
RSS FeedGoogle’s I/O 2026 AI story is about distribution as much as models. Gemini 3.5 Flash is now generally available across API, Antigravity, Android Studio, enterprise tools, Search, and the Gemini app, while Gemini Omni Flash brings video generation into the same push.
Cognition is arguing that coding agents do not have to collapse into model-lab features. It raised more than $1B at a $26B valuation, with Devin’s run-rate revenue reaching $492M.
Claude Opus 4.8 now has a fast mode that runs the same model at roughly 2.5x speed. Claude says the mode is three times cheaper than before, shifting the cost equation for long agent sessions.
Claude Opus 4.8 is showing its strongest early signal in agentic work, not only coding. Artificial Analysis says the model scored 1890 on GDPval-AA, 121 points ahead of GPT-5.5 xhigh.
LocalLLaMA readers noticed the infrastructure lesson: Zai claimed 15% more GPU inference throughput and 40.6% lower first-token P99 latency with the same GPUs, model, and software stack.
LocalLLaMA readers quickly turned the story into an operator checklist: check Starlette, FastAPI, vLLM, LiteLLM, MCP servers, and anything exposed to the Internet.
The Reddit thread zeroed in on a hard lesson for AI-written kernels: verifier success can miss optimizer- and data-dependent numerical failures.
HN readers focused less on the version number and more on whether same-price upgrades, cheaper fast mode, and Claude Code dynamic workflows will show up in real agent sessions.
Mistral is turning Le Chat into Vibe, a combined work and coding agent. The launch adds Work Mode, remote Code Mode, a VS Code extension, CLI updates, and paid plans starting at $14.99 per month.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.