GLM 5.2 hits 64% on Vibe Code Bench as open weights close in
Original: GLM 5.2 breaks 60% as open-weight coding gap narrows View original →
GLM 5.2 has crossed a notable line for open-weight coding models: building web applications from scratch. In a post on X, Vals AI wrote that “GLM 5.2 is the only open-weight model to break 60%” on Vibe Code Bench v1.1, with a reported score of 64%.
The number matters because the gap is not marginal. Vals AI said no other open-weight model on the board reaches 50%, putting GLM 5.2 14 percentage points ahead of the next open-weight entry. That makes the result less about a single leaderboard win and more about whether open models are becoming viable for real app-building workflows that previously leaned on closed frontier systems.
Vals AI describes itself as a public LLM evaluation group and typically posts benchmark comparisons rather than general product marketing. The tweet follows a broader wave of attention around Z.ai’s GLM 5.2, a model positioned around long-context coding and agentic engineering tasks. Vibe Code Bench is especially relevant because it focuses on the end-to-end ability to produce web applications, not only solve isolated programming questions.
The next thing to watch is repeatability. A 64% score is meaningful only if it holds across different prompts, app types, scaffolds, and evaluation settings. Developers will also care about serving cost, latency, tool compatibility, and whether the model’s advantage translates into fewer manual fixes. If the open-weight field follows GLM 5.2 past the 50% mark, the economics of coding agents could shift quickly.
Related Articles
The LocalLLaMA thread is less about bigger models for their own sake and more about hardware buyers who now have memory capacity without a fresh model tier to use it well.
The community debate moved beyond rank: GLM-5.2 looks strong, but output-token hunger and latency now matter as much as benchmark position.
Zhipu AI's GLM-5 has claimed the top spot among open-weights models on the Extended NYT Connections benchmark with a score of 81.8, edging out Kimi K2.5 Thinking's 78.3.