DeepSeek V4 Pro tied with GPT-5.2 on FoodTruck Bench, a 30-day agentic benchmark using 34 tools, arriving roughly 10 weeks after GPT-5.2 was tested at approximately 17x lower cost.
#benchmark
RSS FeedA benchmark comparing vision agents (browser-use) to structured API agents on the same admin panel found vision agents cost roughly 45x more — and failed to complete the task without a 14-step explicit walkthrough.
Poolside AI released Laguna XS.2 on April 28, 2026 under Apache 2.0 — a 33B total/3B active MoE model purpose-built for agentic coding, scoring 68.2% on SWE-bench Verified and deployable on a single consumer GPU.
Released April 29, 2026 under Modified MIT license, Mistral Medium 3.5 consolidates the company's chat, reasoning, and coding models into one 128B dense open-weight model with 256K context, scoring 77.6% on SWE-bench Verified.
A peer-reviewed study published in Science tested OpenAI's o1 on 76 real ER triage cases and found it achieved exact or near-exact diagnoses 67% of the time, versus 55% and 50% for two attending physicians who received identical patient data.
The latest ARC-AGI-3 scores show GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% — the most powerful models today remain effectively at zero on this AGI benchmark.
LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.
Anthropic put hard numbers on Claude's biology capability claims instead of vague lab hype. In 99 real-data bioinformatics problems, the company says experts were stumped on 23 and recent Claude models solved roughly 30% of that hardest slice.
Multimodal agents still pay a tax for chaining separate vision, audio, and text models. NVIDIA says Nemotron 3 Nano Omni collapses that stack into a 30B model with 256K context and up to 9.2x higher effective video system capacity at the same responsiveness target.
The community liked this post for the same reason it immediately started arguing with it: it had real numbers. Q4_K_M came out looking like the practical sweet spot, but commenters quickly pushed on error bars, KV-cache settings, and whether the reported scores made sense at all.
HN jumped straight to a sharper question than the score itself: was this a model win or a harness win? Dirac’s 65.2% TerminalBench run turned into a broader argument about context curation, AST-guided search, and why coding agents still live or die on tooling decisions.
Why it matters: FP8 inference only pays off if the accuracy collapse is fixable. vLLM says a two-level accumulation change lifted 128k needle-in-a-haystack accuracy from 13% to 89% while preserving FP8 decode speed.