A March 31, 2026 Hacker News hit brought attention to Ollama’s new MLX-based Apple Silicon runtime. The announcement combines MLX, NVFP4, and upgraded cache behavior to make local coding-agent workloads on macOS more practical.
LLM
RSS Feed
OpenAI Developers said recent Codex usage data suggests developers are handing off long-running work like refactors and architecture planning at the end of the day. In a follow-up reply, the account said tasks started at 11 pm are 60% more likely than other tasks to run for 3+ hours.
A Reddit thread in r/LocalLLaMA drew 142 upvotes and 29 comments around CoPaw-9B. The discussion focused on its Qwen3.5-based 9B agent positioning, 262,144-token context window, and whether local users would get GGUF or other quantized builds quickly.
Anthropic said on March 30, 2026 that computer use is now available in Claude Code in research preview for Pro and Max plans. Claude Code docs say the feature lets Claude open apps, click through UI flows, and see the screen on macOS from the CLI, targeting native app testing, visual debugging, and other GUI-only tasks.
A March 30, 2026 r/LocalLLaMA post pointed to an experimental ggml backend that sends matrix work to Apple’s Neural Engine. The prototype is not upstream, but it is one of the clearest signs yet that developers are treating ANE as a serious local inference target.
A r/LocalLLaMA post highlighted SentrySearch, a project that uses Qwen3-VL-Embedding to compare text queries directly against raw video. The project avoids transcription and frame captioning while still supporting local search on consumer hardware.
Ollama used a March 30, 2026 preview to move its Apple Silicon path onto MLX. The release pairs higher prefill and decode throughput with NVFP4 support and cache changes aimed at coding and agent workflows.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
A Hacker News thread pushed attention toward Ahmed Nagdy’s interactive Claude Code guide, which packages slash commands, CLAUDE.md patterns, hooks, skills, MCP, and plugins into browser-based lessons and simulators.
OpenAI Developers said on March 30, 2026 that Perplexity has been running voice experiences with the Realtime API in production and published lessons from that work. The post says Perplexity now handles millions of monthly voice sessions and details how the team changed context chunking, standardized audio formats, and tuned turn-taking for noisy real-world environments.
Google DeepMind said on March 26, 2026 that Gemini 3.1 Flash Live is rolling out in Gemini Live and Google Search Live, while developers can access it through Google AI Studio. Google’s announcement positions 3.1 Flash Live as its highest-quality audio model, with lower latency, improved tonal understanding, and benchmark gains including 90.8% on ComplexFuncBench Audio.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.