Anthropic introduced Claude Sonnet 4.6 on February 17, 2026, adding a beta 1M token context window while keeping API pricing at $3/$15 per million tokens. The company says the new default model improves coding, computer use, and long-context reasoning enough to cover more work that previously pushed users toward Opus-class models.
LLM
RSS FeedA r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
A Hacker News submission highlighted Andrej Karpathy's Autoresearch repo, a minimal setup where an AI agent edits one training file, runs fixed 5-minute experiments, and keeps only changes that improve `val_bpb`.
A Hacker News post surfaced Unsloth's Qwen3.5 local guide, which lays out memory targets, reasoning-mode controls, and llama.cpp commands for running 27B and 35B-A3B models on local hardware.
Mistral has launched Mistral 3, a new open multimodal family with dense 14B, 8B, and 3B models under Apache 2.0, plus a larger Mistral Large 3. The company says the lineup was trained from scratch and tuned for both Blackwell NVL72 systems and single-node 8xA100 or 8xH100 deployments.
A merged llama.cpp PR adds MCP server selection, tool calls, prompts, resources, and an agentic loop to the WebUI stack, moving local inference closer to full agent workflows.
A high-scoring LocalLLaMA post highlights Open WebUI’s Open Terminal: a Docker or bare-metal execution layer that lets local models run commands, edit files, and return artifacts through chat.
LocalLLaMA users are tracking llama.cpp’s merged autoparser work, which analyzes model templates to support reasoning and tool-call formats with less custom parser code.
Running Nvidia PersonaPlex 7B in Swift on Apple Silicon moves local voice agents closer to real time
An HN post on a Swift/MLX port of Nvidia PersonaPlex 7B shows how chunking, buffering, and interrupt handling matter as much as raw model quality for local speech-to-speech agents.
Google’s February Gemini update packages Gemini 3.1 Pro, Deep Think, Nano Banana 2, Veo templates, and new Canvas tools into one release. The drop shows Google pushing the Gemini app as a front end for reasoning, image, music, and video workflows rather than a plain chat surface.
OpenAI and Amazon said AWS customers will get a Stateful Runtime Environment in Amazon Bedrock for production-grade agent workflows. The announcement moves agent execution closer to managed AWS infrastructure with persistent state, governance, and long-running workflow support.
Google DeepMind said on March 3, 2026 that Gemini 3.1 Flash-Lite delivers faster performance at a lower price than Gemini 2.5 Flash. Google is rolling the model out in preview via Google AI Studio and Vertex AI for high-volume, latency-sensitive workloads.