Cloudflare is turning AutoRAG into AI Search, a retrieval primitive agents can create and query from Workers. The open beta adds BM25 plus vector search, built-in storage and index, metadata boosting, and cross-instance search with concrete free and paid limits.
LLM
RSS FeedLocalLLaMA did not just vent about weaker models; the thread turned the feeling into questions about provider routing, quantization, peak-time behavior, and how to prove a silent downgrade. The evidence is not settled, but the anxiety is real.
HN did not just ask whether Claude Opus 4.7 scores higher; it asked whether the product behavior is stable enough to build around. The thread quickly moved into adaptive thinking, tokenizer costs, safety filters, and bruised trust after recent Claude complaints.
Cloudflare is trying to make model choice less sticky: AI Gateway now routes Workers AI calls to 70+ models across 12+ providers through one interface. For agent builders, the important part is not the catalog alone but spend controls, retry behavior, and failover in workflows that may chain ten inference calls for one task.
PrismML is testing whether smaller open models can stay useful by changing the weight format, not only the architecture. Ternary Bonsai ships 8B, 4B and 1.7B models at 1.58 bits, with the 8B variant listed at 1.75GB.
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
LocalLLaMA reacted with genuine wonder because the demo is simple to grasp: a 1.7B Bonsai model, about 290MB, running in a browser through WebGPU. The same thread also did the useful reality check, asking about tokens per second, hallucinations, llama.cpp support, and whether 1-bit models are ready for anything beyond narrow tasks.
HN liked the ambition but went straight for the weak points: marketplace demand, MDM trust, Mac privacy claims, and whether the operator economics are believable. Darkbloom says idle Apple Silicon can serve OpenAI-compatible private inference at lower cost; the thread treated that as an architecture and incentives problem, not just a landing page.
HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.
OpenAI’s updated Agents SDK adds a model-native harness and native sandbox execution so agents can inspect files, run commands, edit code, and continue across longer tasks. It launches generally available in Python with support for sandbox providers including Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, and Vercel.
Claude Opus 4.7 is now generally available across Claude products, the API, Amazon Bedrock, Vertex AI, and Microsoft Foundry. Anthropic kept pricing at $5/$25 per million tokens while adding higher-resolution image handling, xhigh effort, and stronger coding-agent behavior.
Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.