#inference

LLM Apr 13, 2026 1 min read

Google adds Flex and Priority tiers to the Gemini API for cost and reliability control

Google is adding Flex and Priority service tiers to the Gemini API so developers can choose lower-cost synchronous inference for background work or higher-assurance routing for critical traffic. The change gives agent builders a cleaner way to separate cost and reliability without splitting architectures across multiple APIs.

#google #gemini-api #inference

LLM Apr 11, 2026 2 min read

Cloudflare brings Kimi K2.5 to Workers AI and pushes deeper into agent infrastructure

Cloudflare moved Workers AI into larger-model territory on March 19, 2026 by adding Moonshot AI’s Kimi K2.5. The company is pitching a single stack for durable agent execution, large-context inference, and lower-cost open-model deployment.

#cloudflare #workers-ai #kimi-k2.5

LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.

#apple-silicon #mlx #speculative-decoding

LLM Reddit Apr 10, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.

#qwen #blackwell #inference

LLM Reddit Apr 10, 2026 2 min read

Reddit Welcomes llama.cpp Tensor Parallelism, With an Experimental Warning Label

A high-scoring LocalLLaMA thread treated merged PR #19378 as a meaningful step toward more practical multi-GPU inference in llama.cpp. The catch is that the new <code>--split-mode tensor</code> path is still explicitly experimental, strongest today on CUDA, and still rough on ROCm and Vulkan.

#llama-cpp #tensor-parallelism #multi-gpu

LLM sources.twitter Apr 8, 2026 2 min read

Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference

On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.

#cursor #moe #inference

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.

#qwen #local-llm #llama-cpp

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.

#localllama #vllm #litellm

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#speculative-decoding #inference #vllm

LLM Reddit Apr 6, 2026 2 min read

LocalLLaMA digs into Gemma 4 Per-Layer Embeddings and why the small models behave differently

A LocalLLaMA explainer argues that Gemma 4 E2B/E4B gain their efficiency from Per-Layer Embeddings. The key point is that many of those parameters behave more like large token lookup tables than always-active compute-heavy layers, which changes the inference trade-off.

#llm #gemma #inference

LLM Reddit Apr 5, 2026 1 min read

LocalLLaMA warns against judging Gemma 4 too early while llama.cpp fixes are still landing

A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.

#gemma-4 #llama-cpp #inference

AI sources.twitter Apr 2, 2026 2 min read

NVIDIA positions Groq 3 LPX as the low-latency inference rack for Vera Rubin

On March 17, 2026, NVIDIADC described Groq 3 LPX on X as a new rack-scale low-latency inference accelerator for the Vera Rubin platform. NVIDIA’s March 16 press release and technical blog say LPX brings 256 LPUs, 128 GB of on-chip SRAM, and 640 TB/s of scale-up bandwidth into a heterogeneous inference path with Vera Rubin NVL72 for agentic AI workloads.

#nvidia #groq-3-lpx #vera-rubin