QVAC SDK 0.12.0 adds TurboQuant as an opt-in KV-cache compression feature for local LLMs. The company says it can cut runtime context memory by up to 5x and put 262K-token 4B-model contexts within reach of 8GB consumer GPUs.
LLM
RSS FeedxAI says Composer 2.5 is now available inside Grok Build. The post describes it as strong at complex instructions and long-running tasks, drawing more than 640K views as coding-agent competition tightens.
OpenAI frontier models and Codex are now generally available on Amazon Bedrock. The post drew more than 1.2M views and points to a broader AWS path for enterprise AI, including future Daybreak security capabilities.
NVIDIA is packaging a 550B-parameter MoE model with agent tooling instead of treating the model as a standalone release. The pitch is concrete: up to 5x faster inference, up to 30% lower cost, and availability beginning June 4.
The LocalLLaMA post drew attention because the headline number is practical: a reported 47% reduction in KV VRAM for RDNA3 users experimenting outside CUDA.
The HN reaction centered on the README as much as the code: a small engine that turns vLLM concepts into a guided implementation path.
The HN discussion focused less on funding theater and more on whether a multi-model gateway can stay defensible as AI workloads move into production.
Anthropic’s May 29 platform notes move Claude Managed Agents deeper into AWS operations. Webhooks, multiagent orchestration, and self-hosted sandboxes are now available on Claude Platform on AWS, with new IAM actions and a managed policy for self-hosted execution.
NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.
xAI is turning Grok Build from a CLI-backed experience into a public API beta. The headline number is pricing: $1 per million input tokens and $2 per million output tokens for agentic coding workloads.
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.