r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
#coding-agents
RSS FeedLocalLLaMA cared about this eval post because it mixed leaderboard data with lived coding-agent pain: Opus 4.7 scored well, but the author says it felt worse in real use.
Factory raised a $150M Series C at a $1.5B valuation. The signal is that coding agents are being sold as enterprise software-factory infrastructure, with model routing, governance, and cost control moving into the product pitch.
Why it matters: enterprise coding agents are moving from experiments to managed infrastructure. Databricks is grouping coding agents, LLM calls, and MCP integrations behind three controls: governance, budgets, and observability.
Why it matters: Anthropic is pushing Opus toward longer autonomous coding work without raising the premium model price. The linked launch page says Opus 4.7 reaches 70% on CursorBench versus 58% for Opus 4.6, while API pricing stays at $5 per million input tokens and $25 per million output tokens.
Cursor is putting usage data behind the claim that better coding models change the shape of developer work. In a 500-team study, high-complexity tasks rose 68%, while documentation grew 62% and UI/styling only 15%.
HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.
LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.
GitHub has expanded Copilot cloud agent on GitHub Mobile beyond pull request review. Developers can now ask the agent to research a codebase, draft an implementation plan, edit on a branch, review diffs, and open a pull request from a phone when ready.
Google says coding agents often produce stale Gemini API code because model training data has a cutoff date, and is shipping Docs MCP plus Developer Skills as the fix. Used together, Google reports a 96.3% pass rate with 63% fewer tokens per correct answer than vanilla prompting on its eval set.
A Hacker News discussion focused on SkyPilot's argument that coding agents work better when they read papers and competing implementations before editing code. In the reported llama.cpp experiments, that research-first loop produced 5 viable optimizations and improved TinyLlama text generation by 15% on x86 and 5% on ARM for about $29.
A Launch HN post with around 260 points introduced Freestyle as infrastructure for coding agents, highlighting sub-second VM startup, live forking of running sandboxes, pause-and-resume persistence, built-in git hosting, and full Linux VMs intended for agent platforms rather than lightweight demo containers.