A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
LLM
RSS FeedvLLM said NVIDIA used the framework for the first MLPerf vision-language benchmark submission built on Qwen3-VL. NVIDIA’s accompanying blog places that result inside a broader Blackwell Ultra push that claims up to 2.7x throughput gains and more than 60% lower token cost on the same infrastructure for some workloads.
A high-scoring LocalLLaMA thread treated merged PR #19378 as a meaningful step toward more practical multi-GPU inference in llama.cpp. The catch is that the new <code>--split-mode tensor</code> path is still explicitly experimental, strongest today on CUDA, and still rough on ROCm and Vulkan.
A Hacker News discussion focused on SkyPilot's argument that coding agents work better when they read papers and competing implementations before editing code. In the reported llama.cpp experiments, that research-first loop produced 5 viable optimizations and improved TinyLlama text generation by 15% on x86 and 5% on ARM for about $29.
On April 9, 2026, Google DeepMind said on X that Gemma 4 crossed 10M downloads in its first week and that the Gemma family overall has topped 500M downloads. Google positions Gemma 4 as an open model family built for reasoning, agentic workflows, and efficient deployment on local hardware.
On April 8, 2026, Anthropic highlighted a new engineering post describing Managed Agents, its hosted service for long-running agent work on the Claude Platform. Anthropic says the system separates session, harness, and sandbox layers so agents can recover more cleanly from failure and connect to customer infrastructure with fewer assumptions.
On April 9, 2026, OpenAI said on X that it is introducing a new $100/month ChatGPT Pro tier aimed at heavier Codex use. OpenAI says the existing $200 Pro tier will remain the highest-usage option while Plus usage is being rebalanced toward more sessions across a week.
A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.
A Hacker News discussion grew around public <code>vercel-plugin</code> hooks that route consent through Claude context, record Bash commands in base telemetry, and store a persistent device ID. The dispute is less about a confirmed exploit than about disclosure, scope, and plugin boundaries in agent tools.
Google DeepMind introduced Gemma 4 on X as a family of open models designed to run on developers’ own hardware. Its April 2, 2026 developer post ties that launch to on-device agentic workflows, support for more than 140 languages, and deployment paths through AICore, AI Edge Gallery, and LiteRT-LM.
A LocalLLaMA post argues that recent llama.cpp fixes justify refreshed Gemma 4 GGUF downloads, especially for users relying on local inference pipelines.
A LocalLLaMA thread highlighted Hugging Face's decision to move Safetensors under the PyTorch Foundation, keeping compatibility intact while shifting governance to a neutral home.