Meta says custom silicon is critical to scaling next-generation AI and has published a roadmap update for its MTIA family. The company says it accelerated development enough to release four generations in two years as model architectures keep changing faster than traditional chip cycles.
#inference
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A new llama.cpp change turns <code>--reasoning-budget</code> into a real sampler-side limit instead of a template stub. The LocalLLaMA thread focused on the tradeoff between cutting long think loops and preserving answer quality, especially for local Qwen 3.5 deployments.
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
Microsoft says Fireworks AI is now part of Microsoft Foundry, bringing high-performance, low-latency open-model inference to Azure. The launch emphasizes day-zero access to leading open models, custom-model deployment, and enterprise controls in one place.
A Launch HN thread pulled RunAnywhere’s MetalRT and RCLI into focus, centering attention on a low-latency STT-LLM-TTS stack that runs on Apple Silicon without cloud APIs.
A fast-rising LocalLLaMA post resurfaced David Noel Ng's write-up on duplicating a seven-layer block inside Qwen2-72B, a no-training architecture tweak that reportedly lifted multiple Open LLM Leaderboard benchmarks.
A Launch HN thread pushed RunAnywhere's RCLI into view as an Apple Silicon-first macOS voice AI stack that combines STT, LLM, TTS, local RAG, and 38 system actions without relying on cloud APIs.
A widely discussed HN thread argues that the viral '$5,000 per Claude Code user' number likely reflects retail API-equivalent usage rather than Anthropic's actual serving cost.
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
A well-received PSA on r/LocalLLaMA argues that convenience layers such as Ollama and LM Studio can change model behavior enough to distort evaluation. The more durable lesson from the thread is reproducibility: hold templates, stop tokens, sampling, runtime versions, and quantization constant before judging a model.