Hacker News Sees GLM-5.1 Push Further Into Long-Horizon Agentic Engineering
Original: GLM-5.1: Towards Long-Horizon Tasks View original →
A Hacker News thread surfaced GLM-5.1 as Z.ai's new flagship for agentic engineering. The company positions it as a long-horizon model rather than a one-shot benchmark climber, and the numbers it published reflect that framing. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and 68.7 on CyberGym, putting the model ahead of GLM-5 across all four tasks and competitive with current frontier coding models.
The more interesting part of the post is how Z.ai evaluates persistence. On a VectorDBBench setup, GLM-5.1 kept optimizing through 600+ iterations and 6,000+ tool calls, eventually reaching 21.5k QPS. Z.ai says that is roughly 6x the best result it had seen in a single 50-turn session. The blog highlights two structural jumps along the way: a move to IVF cluster probing with f16 compression around iteration 90, and a later two-stage pipeline with u8 prescoring plus f16 reranking around iteration 240.
Long-horizon behavior, not just first-pass scores
Z.ai also used KernelBench Level 3 to compare how long models keep making useful progress on GPU-kernel work. In that setting, the post says GLM-5.1 reached 3.6x geometric-mean speedup across 50 problems, staying productive longer than GLM-5, while Claude Opus 4.6 still finished ahead at 4.2x. The company then pushed the model into a much less structured task: building a Linux-style desktop in the browser over an 8-hour self-improvement loop. According to the blog, earlier GLM versions tend to stop after a taskbar and a few placeholder windows, but GLM-5.1 kept adding a file browser, terminal, text editor, system monitor, calculator, and games while refining the UI.
That framing fits the HN reaction. The real claim is not that GLM-5.1 wins every benchmark, because it does not. The claim is that Z.ai is trying to optimize for models that stay useful after the obvious fixes run out, where repeated experimentation, self-evaluation, and tool use matter more than a strong first draft. If that holds up outside vendor-authored evaluations, GLM-5.1 looks less like a routine model refresh and more like a bet on where coding agents are headed next.
Related Articles
Cursor has published a technical report for Composer 2, outlining a two-stage recipe of continued pretraining and large-scale reinforcement learning for agentic software engineering. The company says the model reaches 61.3 on CursorBench, 61.7 on Terminal-Bench, and 73.7 on SWE-bench Multilingual while keeping pricing at $0.50/M input and $2.50/M output tokens.
In a March 29, 2026 X post, OpenAI Developers introduced Codex Security, a research preview aimed at identifying, validating, and remediating software vulnerabilities. The launch extends AI coding assistance into application security workflows.
A popular `r/LocalLLaMA` post highlighted YC-Bench, an evaluation where models run a simulated startup for a year under delayed feedback and adversarial clients. The benchmark's standout result is that only three of twelve tested models consistently beat the starting capital, with GLM-5 coming close to Claude Opus 4.6 at far lower cost.
Comments (0)
No comments yet. Be the first to comment!