r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
#coding-agents
RSS FeedLocalLLaMA cared about this eval post because it mixed leaderboard data with lived coding-agent pain: Opus 4.7 scored well, but the author says it felt worse in real use.
Why it matters: enterprise AI coding is moving from individual tools to governed fleets. Databricks says Unity AI Gateway now centralizes controls for Codex, Cursor, Gemini CLI, MCP integrations, budgets, rate limits, and observability.
Factory raised a $150M Series C at a $1.5B valuation. The signal is that coding agents are being sold as enterprise software-factory infrastructure, with model routing, governance, and cost control moving into the product pitch.
Why it matters: enterprise coding agents are moving from experiments to managed infrastructure. Databricks is grouping coding agents, LLM calls, and MCP integrations behind three controls: governance, budgets, and observability.
Why it matters: Anthropic is pushing Opus toward longer autonomous coding work without raising the premium model price. The linked launch page says Opus 4.7 reaches 70% on CursorBench versus 58% for Opus 4.6, while API pricing stays at $5 per million input tokens and $25 per million output tokens.
Cursor is putting usage data behind the claim that better coding models change the shape of developer work. In a 500-team study, high-complexity tasks rose 68%, while documentation grew 62% and UI/styling only 15%.
HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.
LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.
GitHub has expanded Copilot cloud agent on GitHub Mobile beyond pull request review. Developers can now ask the agent to research a codebase, draft an implementation plan, edit on a branch, review diffs, and open a pull request from a phone when ready.
Google says coding agents often produce stale Gemini API code because model training data has a cutoff date, and is shipping Docs MCP plus Developer Skills as the fix. Used together, Google reports a 96.3% pass rate with 63% fewer tokens per correct answer than vanilla prompting on its eval set.
A Hacker News discussion focused on SkyPilot's argument that coding agents work better when they read papers and competing implementations before editing code. In the reported llama.cpp experiments, that research-first loop produced 5 viable optimizations and improved TinyLlama text generation by 15% on x86 and 5% on ARM for about $29.