OpenAI and Paradigm introduce EVMbench for smart contract security testing
Original: Introducing EVMbench View original →
Why EVMbench was introduced
OpenAI and Paradigm announced EVMbench to measure AI agent capabilities in a high-stakes cybersecurity domain: Ethereum Virtual Machine smart contracts. OpenAI argues this area is economically meaningful because smart contracts secure more than $100B in open-source crypto assets, and improvements in agentic coding systems can benefit both defenders and attackers.
The benchmark includes 120 curated high-severity vulnerabilities from 40 audits, mostly sourced from open audit competitions, with additional scenarios inspired by security work on Tempo, a payment-oriented blockchain. This mix is intended to represent realistic exploit and remediation patterns while keeping evaluation reproducible.
How the benchmark works
EVMbench evaluates agents in three modes. In detect, agents audit repositories and are scored on recall of known vulnerabilities. In patch, agents must eliminate exploitability while preserving intended functionality, validated with automated tests. In exploit, agents execute end-to-end fund-draining attacks in a sandboxed blockchain environment.
OpenAI built a Rust-based harness to keep grading deterministic, replay transactions, and restrict unsafe RPC behavior. Exploit tasks run in an isolated local Anvil chain rather than live networks, and tasks are based on historical publicly documented vulnerabilities.
Results and what they imply
OpenAI reports GPT-5.3-Codex reached 72.2% in exploit mode, compared with 31.9% for GPT-5 released roughly six months earlier. The company says detect recall and patch success remain below full coverage, indicating that exhaustive auditing and safe remediation are still difficult for current agents even as exploit execution improves quickly.
This asymmetry matters for risk management. If exploit competence rises faster than reliable patching, the defensive ecosystem needs stronger automation, faster review pipelines, and better evaluation standards.
Limitations and defensive commitments
OpenAI notes several limitations: the dataset is drawn largely from Code4rena-style competitions, detect mode cannot reliably adjudicate novel findings beyond human-labeled issues, and exploit evaluation excludes some timing-sensitive or multi-chain conditions.
Alongside the release, OpenAI described an evidence-based cyber safety approach that combines model safeguards, monitoring, trusted access controls, and enforcement workflows. It also committed an additional $10M in API credits through its Cybersecurity Grant Program to support good-faith defensive research, especially for open source software and critical infrastructure.
Related Articles
OpenAIDevs pointed developers to Codex Security on March 29, 2026, positioning it as a way to find, validate, and remediate likely vulnerabilities in connected GitHub repositories. OpenAI's docs say the system scans commit by commit, uses repo-specific threat models, validates high-signal findings in an isolated environment, and can move reviewed findings toward GitHub pull requests.
This is a distribution story, not just a usage milestone. OpenAI says Codex grew from more than 3 million weekly developers in early April to more than 4 million two weeks later, and it is pairing that demand with Codex Labs plus seven global systems integrators to turn pilots into production rollouts.
A r/LocalLLaMA benchmark compared 21 local coding models on HumanEval+, speed, and memory, putting Qwen 3.6 35B-A3B on top while surfacing practical RAM and tok/s trade-offs.
Comments (0)
No comments yet. Be the first to comment!