Snyk’s 300-run test exposes unstable LLM security-review queues

The same code and prompt can produce a materially different security queue. That is the practical result from Snyk VulnBench JS 1.0, published June 29, 2026. The benchmark does not try to crown a universal winner. It asks a narrower question that matters for coding agents and CI: if an LLM reviews the same vulnerable JavaScript project five times, does it report the same bugs?

The setup used 10 JavaScript and Express fixture projects with 44 Snyk Code reference findings. Six configurations were evaluated: Snyk Code SAST and five Claude model setups through a Claude Code harness. Each configuration ran each task five times, producing 10 tasks x 6 configurations x 5 repetitions, or 300 total runs. The models could inspect the project files, but they could not read the reference findings file.

The best LLM configuration was Claude Opus 4.6 Medium, at 75.4% Snyk-reference F1, 68.0% recall, and 91.5% precision. Snyk Code SAST reproduced its own reference set at 100.0% F1 with 0.0 percentage-point standard deviation. Snyk is careful about what that means: the reference set is not an independent claim of perfect vulnerability coverage. It is a deterministic baseline for measuring agreement, variance, and where model behavior diverges.

The repeatability gap showed up most clearly in unmatched model findings. Across all model configurations, 80 of 161 unique unmatched finding signatures appeared in only one of five repeated runs, or 49.7%. By contrast, when models matched a Snyk Code reference finding, the behavior was much steadier: 134 of 158 unique reference-matched findings appeared in all five repetitions. Known vulnerability shapes tended to recur; extra model-only reports were much less stable.

Cost did not map cleanly to coverage. Claude Opus 4.7 Max averaged 95,969 tokens and $0.3559 per session, but scored 68.8% Snyk-reference F1. Claude Opus 4.6 Medium averaged 51,574 tokens and $0.0628 per session while scoring 75.4%. On small fixtures, those dollar amounts are modest. At pull-request and CI scale, repeated review cost and triage churn become product constraints.

The useful takeaway is not LLM versus SAST. Snyk’s data shows different failure modes. Models were strong on familiar exploit shapes such as command injection, hardcoded credentials, SQL injection, SSRF, open redirect, prototype pollution, and ReDoS. They were weaker on systematic classes such as repeated path traversal flows, resource-limit findings, improper sanitization, type validation, insecure transport, and framework information exposure. The next benchmark step is larger fixtures, independent ground truth, and combined LLM+SAST workflows.

Snyk’s 300-run test exposes unstable LLM security-review queues

Related Articles

Liquid AI Releases LFM2.5: 8B MoE Model Trained on 38T Tokens

Claude identity checks turn model access into the real debate

A 2,000-person AI assistant attack test raises a harder question about responses

Related Articles

Liquid AI Releases LFM2.5: 8B MoE Model Trained on 38T Tokens
LLM Hacker News May 30, 2026 1 min read

Claude identity checks turn model access into the real debate
LLM Hacker News Jun 22, 2026 1 min read

A 2,000-person AI assistant attack test raises a harder question about responses