Google is adding Flex and Priority service tiers to the Gemini API so developers can choose lower-cost synchronous inference for background work or higher-assurance routing for critical traffic. The change gives agent builders a cleaner way to separate cost and reliability without splitting architectures across multiple APIs.
#inference
RSS FeedCloudflare moved Workers AI into larger-model territory on March 19, 2026 by adding Moonshot AI’s Kimi K2.5. The company is pitching a single stack for durable agent execution, large-context inference, and lower-cost open-model deployment.
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
A high-scoring LocalLLaMA thread treated merged PR #19378 as a meaningful step toward more practical multi-GPU inference in llama.cpp. The catch is that the new <code>--split-mode tensor</code> path is still explicitly experimental, strongest today on CUDA, and still rough on ROCm and Vulkan.
On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A LocalLLaMA explainer argues that Gemma 4 E2B/E4B gain their efficiency from Per-Layer Embeddings. The key point is that many of those parameters behave more like large token lookup tables than always-active compute-heavy layers, which changes the inference trade-off.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
On March 17, 2026, NVIDIADC described Groq 3 LPX on X as a new rack-scale low-latency inference accelerator for the Vera Rubin platform. NVIDIA’s March 16 press release and technical blog say LPX brings 256 LPUs, 128 GB of on-chip SRAM, and 640 TB/s of scale-up bandwidth into a heterogeneous inference path with Vera Rubin NVL72 for agentic AI workloads.