A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
LLM
OpenAI released GPT-5.3 Instant on March 3, 2026, claiming fewer unnecessary refusals, better web synthesis, and double-digit hallucination reductions across internal evaluations.
Google introduced Gemini 3.1 Flash-Lite on March 3, 2026 as its fastest and most cost-efficient model in the Gemini 3 family. The model ships in Google AI Studio and Vertex AI with a 1 million-token context window and API-level reasoning budget controls.
OpenAI released GPT-5.3 Instant on March 3, 2026 as a faster, cheaper derivative of GPT-5.3 for day-to-day use. The company says it matches GPT-4.1 latency and pricing while improving instruction following and reducing hallucinations.
A demo running Qwen 3.5 0.8B entirely in the browser using WebGPU and Transformers.js scored 440 on r/LocalLLaMA. No server, no API key, no installation required — just a modern browser with GPU access.
A widely-shared r/LocalLLaMA comparison of Qwen's smallest models across three generations (score: 681) reveals extraordinary efficiency gains. The Qwen 3.5 9B now outperforms the previous-generation 80B on several benchmarks, while the 2B handles video understanding better than many 7B models.
Anthropic's Claude Opus 4.6 independently solved a directed Hamiltonian cycle decomposition problem that computer science legend Donald Knuth had spent weeks working on. Knuth documented the achievement in a formal Stanford paper, marking one of the first times a top-tier computer scientist has formally credited an LLM with solving a genuine research problem.
Alibaba Qwen team released the Qwen 3.5 small model series (0.8B to 9B). Models run in-browser via WebGPU and show dramatic benchmark improvements over previous generations.
Chinese AI lab DeepSeek plans to release its flagship V4 model this week—a 1-trillion-parameter native multimodal model built around Huawei Ascend chips that deliberately bypasses Nvidia and AMD.
Anthropic launched Claude Cowork plugins that embed Claude natively into Microsoft Excel, PowerPoint, Slack, Gmail, and Google Drive—enabling autonomous cross-app workflows for enterprise users.
Researchers have demonstrated that transformer models with fewer than 100 parameters can add two 10-digit numbers with 100% accuracy using digit tokenization, challenging assumptions about the minimum complexity needed for arithmetic reasoning.
Inception Labs has released Mercury 2, the first production-ready diffusion language model for reasoning. Running at over 1,000 tokens per second on Blackwell GPUs, it is dramatically faster and cheaper than leading autoregressive competitors.