#inference

LLM sources.twitter Mar 26, 2026 2 min read

Vercel launches unified reporting for AI Gateway usage across providers, users, and pricing tiers

Vercel said on March 25, 2026 that its Custom Reporting API for AI Gateway is now in beta for Pro and Enterprise plans. Vercel's blog says teams can query cost, token usage, and request volume across AI Gateway traffic, including BYOK requests, and break results down by model, provider, user ID, tags, and credential type.

#vercel #ai-gateway #cost-observability

LLM Hacker News Mar 26, 2026 2 min read

A ground-up quantization guide clarifies where LLM cost really lives

ngrok’s March 25, 2026 explainer lays out how quantization can make LLMs roughly 4x smaller and 2x faster, and what the real 4-bit versus 8-bit tradeoff looks like. Hacker News drove the post to 247 points and 46 comments, reopening the discussion around memory bottlenecks and the economics of local inference.

#quantization #llm #inference

LLM Reddit Mar 26, 2026 2 min read

Intel’s Arc Pro B70/B65 lands squarely in the local LLM conversation

A LocalLLaMA thread about Intel’s Arc Pro B70 and B65 reached 213 upvotes and 133 comments. Intel says the B70 is available from March 25, 2026 with a suggested starting price of $949, while the B65 follows in mid-April.

#intel #gpu #vram

LLM Hacker News Mar 26, 2026 2 min read

TurboQuant pushes KV cache compression into the center of LLM systems design

Google Research introduced TurboQuant on March 24, 2026 as a compression approach for KV cache and vector search bottlenecks. Hacker News pushed the post to 491 points and 129 comments, reflecting how central memory efficiency has become for long-context inference.

#quantization #kv-cache #inference

LLM Mar 25, 2026 2 min read

AWS and Cerebras plan a disaggregated inference stack for Amazon Bedrock

AWS and Cerebras said on March 13, 2026 that they are building a high-speed inference offering for Amazon Bedrock. The design splits prefill work to AWS Trainium and decode work to Cerebras CS-3 systems.

#aws #cerebras #inference

LLM Reddit Mar 24, 2026 1 min read

LocalLLaMA highlights FlashAttention-4 gains on Blackwell and the limits for everyday GPUs

A technical LocalLLaMA thread translated the FlashAttention-4 paper into practical deployment guidance, emphasizing huge Blackwell gains, faster Python-based kernel development, and the fact that most A100 or consumer-GPU users cannot use the full benefits yet.

#flashattention #inference #gpu

LLM sources.twitter Mar 23, 2026 1 min read

Cloudflare brings Kimi K2.5 to Workers AI and tunes the stack for agents

Cloudflare said on X on March 19 that Kimi K2.5 is now available on Workers AI. The launch pairs a frontier open-source model with platform features aimed at lowering latency and cost for agent workloads.

#cloudflare #workers-ai #kimi-k2.5

LLM sources.twitter Mar 23, 2026 2 min read

Cloudflare brings Kimi K2.5 to Workers AI and shows how it cut internal agent costs

Cloudflare said on March 20, 2026 that Kimi K2.5 is now available on Workers AI so developers can run agents end-to-end on its platform. The linked Cloudflare blog says the model ships with a 256K context window, multi-turn tool calling, vision, and structured outputs, and that one internal agent workload cut costs by 77% after the switch.

#cloudflare #workers-ai #kimi-k2.5

LLM Hacker News Mar 23, 2026 2 min read

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.

#llm #mixture-of-experts #metal

LLM Mar 22, 2026 2 min read

Google rolls out Gemini 3.1 Flash-Lite preview for high-volume, cost-sensitive LLM workloads

Google has introduced Gemini 3.1 Flash-Lite in preview through Google AI Studio and Vertex AI. The company is positioning it as the fastest and most cost-efficient model in the Gemini 3 family for large-scale inference jobs.

#google #gemini #llm

LLM sources.twitter Mar 22, 2026 2 min read

Cloudflare brings Kimi K2.5 to Workers AI and says agent coding reviews cut costs by 77%

Cloudflare said on March 20, 2026 that Kimi K2.5 was available on Workers AI so developers could build end-to-end agents on Cloudflare’s platform. Its launch post says the model brings a 256k context window, multi-turn tool calling, vision inputs, and structured outputs, while an internal security-review agent processing 7B tokens per day cut costs by 77% after the switch.

#cloudflare #workers-ai #kimi-k2-5

LLM Hacker News Mar 22, 2026 2 min read

Hacker News Flags Mamba-3 as an Inference-First State Space Model Push

Together AI and collaborators introduced Mamba-3 as an inference-first state space model. Hacker News traction centered on faster prefill+decode latency, a stronger recurrence design, and open-sourced high-performance kernels.

#mamba #ssm #inference