GLM 5.2 tops Claude Code in Semgrep security benchmark
Original: GLM 5.2 beats Claude in our benchmarks View original →
Semgrep’s latest security benchmark puts Zhipu AI’s GLM 5.2 ahead of Claude Code on IDOR detection. Under the same dataset and prompt-only setup, GLM 5.2 reached 39% F1, while Claude Code scored 32%. Semgrep also estimated the GLM 5.2 run at roughly $0.17 per vulnerability found.
The result is not a claim that open models have solved application security. Semgrep’s own multimodal pipeline still scored higher at 53-61% F1. That gap matters because the pipeline is not just a raw model call; it combines model reasoning with static-analysis signals, rules, and a security-specific workflow.
What makes the post interesting is where the frontier moved. Security bug discovery has been a difficult area for smaller or open-weight models because it needs repository context, control-flow reasoning, and enough restraint to avoid false positives. GLM 5.2 doing well in a prompt-only setting gives teams a reason to test open models for internal code review and triage work, especially where data control and inference cost matter.
The HN discussion quickly shifted from the leaderboard to deployment reality. Some commenters described GLM 5.2 as a useful daily coding model; others asked what hardware can realistically serve a model of this size. That tension is the story: GLM 5.2 did not replace a purpose-built security system, but it did make the open-weight option harder to dismiss.
Related Articles
Snyk VulnBench JS 1.0 repeated JavaScript vulnerability reviews 300 times to test whether LLM security findings recur. The best LLM setup reached 75.4% Snyk-reference F1, while 49.7% of unmatched model-only findings appeared in just one of five identical runs.
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
HN’s roughly 300-point discussion looked past the leaked-secret result and asked whether the setup matched real assistant risk.