A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
LLM
RSS FeedGoogle has released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, achieving up to 3x inference speedup through speculative decoding without any loss in output quality.
OpenAI launched GPT-5.5 Instant as ChatGPT's new default model, replacing GPT-5.3 Instant. The update delivers 52.5% fewer hallucinations on high-stakes topics like medicine, law, and finance, along with more concise responses and enhanced personalization using Gmail and past conversations.
DeepSeek V4 Pro tied with GPT-5.2 on FoodTruck Bench, a 30-day agentic benchmark using 34 tools, arriving roughly 10 weeks after GPT-5.2 was tested at approximately 17x lower cost.
GPT-5.5 Instant is now ChatGPT's default model, replacing GPT-5.3 Instant. The update cuts hallucinated claims by 52.5% on high-stakes medical, legal, and financial prompts, and adds Gmail-based personalization and memory-source transparency.
Sakana AI released KAME, a tandem speech-to-speech architecture that pairs a low-latency front-end S2S model with a back-end LLM via an oracle stream, achieving MT-Bench 6.43 with near-zero response latency and eliminating the typical 2.1-second pipeline delay.
Poolside AI released Laguna XS.2 on April 28, 2026 under Apache 2.0 — a 33B total/3B active MoE model purpose-built for agentic coding, scoring 68.2% on SWE-bench Verified and deployable on a single consumer GPU.
Released April 29, 2026 under Modified MIT license, Mistral Medium 3.5 consolidates the company's chat, reasoning, and coding models into one 128B dense open-weight model with 256K context, scoring 77.6% on SWE-bench Verified.
Anthropic unveiled Claude Opus 4.7 and ten pre-built financial services AI agents at an invite-only Wall Street briefing on May 5, alongside full Microsoft 365 integration and a Moody's data partnership covering 600 million businesses.
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
DeepClaude keeps Claude Code's complete agent loop — file editing, bash, subagent spawning — while routing API calls to DeepSeek V4 Pro or other backends, cutting output token costs from $15/M to $0.87/M.
Andrej Karpathy shared highlights from his Sequoia Ascent 2026 fireside chat, arguing that LLMs open genuinely new categories of functionality, not just faster versions of what already existed.