Hacker News Sees GLM-5.1 Push Further Into Long-Horizon Agentic Engineering
Original: GLM-5.1: Towards Long-Horizon Tasks View original →
A Hacker News thread surfaced GLM-5.1 as Z.ai's new flagship for agentic engineering. The company positions it as a long-horizon model rather than a one-shot benchmark climber, and the numbers it published reflect that framing. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and 68.7 on CyberGym, putting the model ahead of GLM-5 across all four tasks and competitive with current frontier coding models.
The more interesting part of the post is how Z.ai evaluates persistence. On a VectorDBBench setup, GLM-5.1 kept optimizing through 600+ iterations and 6,000+ tool calls, eventually reaching 21.5k QPS. Z.ai says that is roughly 6x the best result it had seen in a single 50-turn session. The blog highlights two structural jumps along the way: a move to IVF cluster probing with f16 compression around iteration 90, and a later two-stage pipeline with u8 prescoring plus f16 reranking around iteration 240.
Long-horizon behavior, not just first-pass scores
Z.ai also used KernelBench Level 3 to compare how long models keep making useful progress on GPU-kernel work. In that setting, the post says GLM-5.1 reached 3.6x geometric-mean speedup across 50 problems, staying productive longer than GLM-5, while Claude Opus 4.6 still finished ahead at 4.2x. The company then pushed the model into a much less structured task: building a Linux-style desktop in the browser over an 8-hour self-improvement loop. According to the blog, earlier GLM versions tend to stop after a taskbar and a few placeholder windows, but GLM-5.1 kept adding a file browser, terminal, text editor, system monitor, calculator, and games while refining the UI.
That framing fits the HN reaction. The real claim is not that GLM-5.1 wins every benchmark, because it does not. The claim is that Z.ai is trying to optimize for models that stay useful after the obvious fixes run out, where repeated experimentation, self-evaluation, and tool use matter more than a strong first draft. If that holds up outside vendor-authored evaluations, GLM-5.1 looks less like a routine model refresh and more like a bet on where coding agents are headed next.
Related Articles
OpenAI is pitching GPT-5.5 as more than a routine model refresh. With 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and a claim that it keeps GPT-5.4-level latency, the company is resetting expectations for long-running coding agents.
OpenAI is pushing harder into agentic work, not just chat. On the company's own evals, GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, beats GPT-5.4 by 7.6 points, and uses fewer tokens in Codex.
HN treated OpenAI's post less as benchmark housekeeping and more as an obituary for a famous coding leaderboard. The thread cared far more about flawed tests and contamination than about who happened to top the chart first.
Comments (0)
No comments yet. Be the first to comment!