Open-weight coding models that can run locally are still scarce. Poolside has pushed Laguna XS.2 into that lane with a 33B total / 3B active MoE that fits a single GPU, and its technical note claims 44.5% on SWE-bench Pro.
LLM
RSS FeedLocalLLaMA got animated because the post promised something people can feel immediately: less reasoning drag. A user claims a small GBNF constraint cut Qwen3.6 token burn hard enough to speed up long tasks without wrecking benchmark scores.
HN latched onto the money leak before the bug itself. A report that Claude Managed Agents append a malware reminder to every file read, then sometimes refuse to edit code anyway, turned into a broader argument about opaque token spend and whether agent harnesses deserve more scrutiny.
The community liked this post for the same reason it immediately started arguing with it: it had real numbers. Q4_K_M came out looking like the practical sweet spot, but commenters quickly pushed on error bars, KV-cache settings, and whether the reported scores made sense at all.
This was not just another “local models are bad” rant. The thread blew up because it mixed a blunt reality check with a serious counterargument: some of the pain comes from small models, but a lot of it may come from the harness wrapped around them.
HN jumped straight to a sharper question than the score itself: was this a model win or a harness win? Dirac’s 65.2% TerminalBench run turned into a broader argument about context curation, AST-guided search, and why coding agents still live or die on tooling decisions.
Anthropic is no longer pitching Claude as a chatbot that sits beside creative software. On April 28, 2026 it pushed Claude into Adobe, Blender, Autodesk, Ableton, Splice, and other tools, turning connectors into a serious product wedge.
Why it matters: OpenAI is moving deeper into enterprise infrastructure, not just model APIs. On April 28, 2026, it put GPT-5.5 on Amazon Bedrock, extended Codex to AWS, and launched Bedrock Managed Agents in limited preview.
Why it matters: FP8 inference only pays off if the accuracy collapse is fixable. vLLM says a two-level accumulation change lifted 128k needle-in-a-haystack accuracy from 13% to 89% while preserving FP8 decode speed.
LocalLLaMA latched onto a very concrete claim: if a 27B model fits entirely in VRAM across two mismatched cards, even a weak second GPU can be better than spilling into system RAM for long-context decoding.
r/singularity loved the premise immediately: a 13B model trapped at a 1930 knowledge cutoff. The upvotes came from the mix of novelty and real research value, because Talkie is not just a gimmick chat partner but a clean lab for studying what models learn without the modern web.
Hacker News was drawn less to the travel flex than to the hard limits: battery drain near 1% per minute, uncomfortable thermals, long-context slowdown, and the familiar feeling that local models still need babysitting on real work.