The thread split between the convenience of “local LLM in Chrome” and corrections about WebGPU acceleration, model identity, and browser-controlled limits.
A viral LocalLLaMA post describes how Qwen3.6 35B A3B transformed complex workflows by combining Codex for task execution with skill documentation, feeding those skills to the pi agent — automating VPS management, PDF conversion, and more.
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
The popular text-generation-webui project, rebranded as TextGen, has relaunched as a no-install native desktop app for Windows, Linux, and macOS. Built on a minimal Electron integration, it positions itself as a fully open-source alternative to LM Studio.
A LocalLLaMA user built a 768GB RAM system using discontinued Intel Optane Persistent Memory from the secondhand market, running the 1-trillion-parameter Kimi K2.5 model locally at over 4 tokens per second.
NVIDIA AI has released Star Elastic, an innovative architecture that packs 30B, 23B, and 12B reasoning models into a single checkpoint, enabling zero-shot slicing to dynamically switch between model scales without separate downloads.
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
AMD's Ryzen AI Max Pro 495 (Gorgon Halo) has leaked with 192GB of unified memory, up 50% from the 128GB in the current Strix Halo. The upgrade would enable running significantly larger AI models locally without discrete GPU memory limits.
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
A local LLM researcher achieved 95.7% on SimpleQA using Qwen3.6-27B with agentic search on a single consumer GPU.
A local LLM researcher achieved 95.7% on SimpleQA using Qwen3.6-27B with agentic search on a single consumer GPU.