A LocalLLaMA community member completed a 16-node DGX Spark cluster with 200 Gbps networking, optimized for unified-memory LLM inference and planning tests with DeepSeek and Kimi models.
LLM
RSS FeedOpenAI has released Symphony, an open-source specification that turns issue trackers like Linear into a control plane for autonomous coding agents. The system assigns a Codex agent per task, handles CI, rebasing, and PR management without human oversight.
LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.
Why it matters: leaderboard gains are more meaningful when they arrive with a cheaper training bill. Baidu says ERNIE 5.1 Preview ranks #13 globally and #1 among Chinese labs on LMArena Text while using about 6% of the pretraining cost of comparable models.
LocalLLaMA cared less about headline speed than a Qwen3.6 setup on one RTX 3090 that reached 218K context and stopped crashing on long tool outputs.
LocalLLaMA reacted hard because DeepSeek's visual-primitives idea makes points and boxes part of reasoning itself, and the repo going private only made the thread hotter.
Warp is opening more than source code. The terminal company put its client under AGPL, moved product planning into public GitHub issues, and says nearly 1 million active developers can now steer Oz-powered agent builds in the open.
Why it matters: kernel work is what decides whether long-context and edge-side agent systems stay theoretical or become cheap enough to run. Qwen says FlashQLA delivers 2-3x forward speedup and 2x backward speedup over the FLA Triton kernel on NVIDIA Hopper.
Why it matters: faster models stop feeling fast if orchestration overhead eats the gain. OpenAI says WebSocket mode made agent workflows up to 40% faster end to end, while lifting effective inference speed from about 65 to nearly 1,000 tokens per second.
LocalLLaMA reacted to this post because it brought hard numbers, not vendor marketing: a dual RTX 5060 Ti 16GB setup pushing Qwen3.6 27B to roughly 60 tok/s with a 204k context window.
Hacker News liked the joke, but the real draw was OpenAI showing how a playful reward signal inside the Nerdy personality leaked creature metaphors into GPT-5.x behavior.
NVIDIA is targeting the cost bottleneck in multimodal agents, not just the demo factor. Nemotron 3 Nano Omni claims up to 9x higher throughput, a 256K context window, and six leaderboard wins for document, video, and audio understanding.