Snyk’s 300-run test exposes unstable LLM security-review queues
Original: Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice? View original →
The same code and prompt can produce a materially different security queue. That is the practical result from Snyk VulnBench JS 1.0, published June 29, 2026. The benchmark does not try to crown a universal winner. It asks a narrower question that matters for coding agents and CI: if an LLM reviews the same vulnerable JavaScript project five times, does it report the same bugs?
The setup used 10 JavaScript and Express fixture projects with 44 Snyk Code reference findings. Six configurations were evaluated: Snyk Code SAST and five Claude model setups through a Claude Code harness. Each configuration ran each task five times, producing 10 tasks x 6 configurations x 5 repetitions, or 300 total runs. The models could inspect the project files, but they could not read the reference findings file.
The best LLM configuration was Claude Opus 4.6 Medium, at 75.4% Snyk-reference F1, 68.0% recall, and 91.5% precision. Snyk Code SAST reproduced its own reference set at 100.0% F1 with 0.0 percentage-point standard deviation. Snyk is careful about what that means: the reference set is not an independent claim of perfect vulnerability coverage. It is a deterministic baseline for measuring agreement, variance, and where model behavior diverges.
The repeatability gap showed up most clearly in unmatched model findings. Across all model configurations, 80 of 161 unique unmatched finding signatures appeared in only one of five repeated runs, or 49.7%. By contrast, when models matched a Snyk Code reference finding, the behavior was much steadier: 134 of 158 unique reference-matched findings appeared in all five repetitions. Known vulnerability shapes tended to recur; extra model-only reports were much less stable.
Cost did not map cleanly to coverage. Claude Opus 4.7 Max averaged 95,969 tokens and $0.3559 per session, but scored 68.8% Snyk-reference F1. Claude Opus 4.6 Medium averaged 51,574 tokens and $0.0628 per session while scoring 75.4%. On small fixtures, those dollar amounts are modest. At pull-request and CI scale, repeated review cost and triage churn become product constraints.
The useful takeaway is not LLM versus SAST. Snyk’s data shows different failure modes. Models were strong on familiar exploit shapes such as command injection, hardcoded credentials, SQL injection, SSRF, open redirect, prototype pollution, and ReDoS. They were weaker on systematic classes such as repeated path traversal flows, resource-limit findings, improper sanitization, type validation, insecure transport, and framework information exposure. The next benchmark step is larger fixtures, independent ground truth, and combined LLM+SAST workflows.
Related Articles
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
The community focus was not the help-center wording, but the way premium model access is becoming tied to identity checks.
HN’s roughly 300-point discussion looked past the leaked-secret result and asked whether the setup matched real assistant risk.