r/LocalLLaMA Highlights YC-Bench for Long-Horizon Agent Performance
Original: We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost. View original →
A widely shared r/LocalLLaMA post stood out because it pointed to an evaluation that is much harder to game than a short benchmark prompt. The post links the YC-Bench paper, a public leaderboard, and the open GitHub repository, framing the project as a test of whether agents can stay strategically coherent over long periods rather than simply produce one impressive answer. That framing is exactly why the thread caught attention in a local-model community that increasingly cares about agent reliability, not just one-shot eloquence.
The benchmark asks an agent to run a simulated startup over a one-year horizon spanning hundreds of turns. It has to manage employees, choose task contracts, maintain profitability, and survive a partially observable market where some clients inflate work requirements after a contract is accepted. According to the paper abstract, the authors evaluated twelve models across three seeds each, and only three models consistently finished above the starting capital of $200K. Claude Opus 4.6 posted the highest average final funds at $1.27M, while GLM-5 reached $1.21M at roughly eleven times lower inference cost.
The most useful result is not just the ranking but the failure analysis. The paper says scratchpad usage, which is the main way to preserve information once context is truncated, is the strongest predictor of success. It also identifies adversarial client detection as the main failure mode, accounting for 47% of bankruptcies. In other words, long-horizon agent quality looks less like a trivia contest and more like a test of memory discipline, strategic consistency, and willingness to update plans under delayed feedback. That is a better match for real operational workloads than most short-form evals.
The Reddit discussion matters because it captures a shift in what advanced users want from benchmarks. Communities like LocalLLaMA are no longer satisfied with static reasoning scores alone; they want evidence that a model can handle an evolving environment, remember what it learned, and avoid self-inflicted drift. YC-Bench is still a simulation, but it pushes evaluation closer to the kinds of long-running workflows that agent builders actually care about.
Related Articles
Anthropic announced a 50% increase in weekly usage limits for Claude Code, effective through July 13. The temporary boost gives developers significantly more capacity for AI-assisted coding.
The free add-on for existing Claude subscribers connects to QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, Microsoft 365, and Slack with 15 ready-to-run workflows covering payroll, invoicing, lead triage, and cash-flow monitoring.
OpenAI announced that Codex, its AI coding agent, is coming to the ChatGPT mobile app, enabling users to write, edit, and debug code directly from their smartphones.
Comments (0)
No comments yet. Be the first to comment!