Shared in LocalLLaMA, autoresearch is a minimal framework where an agent edits PyTorch training code, runs fixed five-minute experiments, and keeps changes that improve validation bits-per-byte.
#llm
RSS FeedGoogle AI Developers has released Android Bench, an official leaderboard for LLMs on Android development tasks. In the first results, Gemini 3.1 Pro ranks first, and Google is also publishing the benchmark, dataset, and test harness.
A well-received HN post highlighted Sarvam AI’s decision to open-source Sarvam 30B and 105B, two reasoning-focused MoE models trained in India under the IndiaAI mission. The announcement matters because it pairs open weights with concrete product deployment, inference optimization, and unusually strong Indian-language benchmarks.
Katana Quant's post, which gained traction on Hacker News, turns a familiar complaint about AI code into a measurable engineering failure. The practical message is straightforward: define acceptance criteria before code generation, not after.
A high-traction Hacker News thread highlighted Simon Willison’s "Agentic Engineering Patterns" guide, which organizes practical workflows for coding agents. The focus is operational discipline: testing-first loops, readable change flow, and reusable prompts.
A LocalLLaMA thread spotlights FlashAttention-4, which reports up to 1605 TFLOPs/s on B200 BF16 and introduces pipeline and memory-layout changes tuned for Blackwell constraints.
A post in r/artificial amplified an Ars Technica report on LLM-driven deanonymization research, including results up to 68% recall and 90% precision across multiple social datasets.
A high-ranking Hacker News thread highlighted a two-sided Qwen story: rapid model quality gains and potential organizational instability. As Qwen 3.5 expands across model sizes, reported leadership departures raise questions about roadmap continuity in the open-weight LLM ecosystem.
A high-signal Hacker News thread surfaced Unsloth’s Qwen3.5 guide, which maps model sizes to bf16 LoRA VRAM budgets and clarifies MoE, vision, and export paths for production workflows.
Alibaba Qwen team released the Qwen 3.5 small model series (0.8B to 9B). Models run in-browser via WebGPU and show dramatic benchmark improvements over previous generations.
Developer Nick Tikhonov shares how he built a voice AI agent achieving ~400ms end-to-end latency with a full STT → LLM → TTS pipeline, including clean barge-ins and no precomputed responses.
A counterintuitive study found that programming AI agents with more assertive, 'rude' conversational behaviors — including interrupting and strategic silence — significantly improved their performance on complex reasoning tasks.