OpenAI and Paradigm introduce EVMbench for smart contract security testing
Original: Introducing EVMbench View original →
Why EVMbench was introduced
OpenAI and Paradigm announced EVMbench to measure AI agent capabilities in a high-stakes cybersecurity domain: Ethereum Virtual Machine smart contracts. OpenAI argues this area is economically meaningful because smart contracts secure more than $100B in open-source crypto assets, and improvements in agentic coding systems can benefit both defenders and attackers.
The benchmark includes 120 curated high-severity vulnerabilities from 40 audits, mostly sourced from open audit competitions, with additional scenarios inspired by security work on Tempo, a payment-oriented blockchain. This mix is intended to represent realistic exploit and remediation patterns while keeping evaluation reproducible.
How the benchmark works
EVMbench evaluates agents in three modes. In detect, agents audit repositories and are scored on recall of known vulnerabilities. In patch, agents must eliminate exploitability while preserving intended functionality, validated with automated tests. In exploit, agents execute end-to-end fund-draining attacks in a sandboxed blockchain environment.
OpenAI built a Rust-based harness to keep grading deterministic, replay transactions, and restrict unsafe RPC behavior. Exploit tasks run in an isolated local Anvil chain rather than live networks, and tasks are based on historical publicly documented vulnerabilities.
Results and what they imply
OpenAI reports GPT-5.3-Codex reached 72.2% in exploit mode, compared with 31.9% for GPT-5 released roughly six months earlier. The company says detect recall and patch success remain below full coverage, indicating that exhaustive auditing and safe remediation are still difficult for current agents even as exploit execution improves quickly.
This asymmetry matters for risk management. If exploit competence rises faster than reliable patching, the defensive ecosystem needs stronger automation, faster review pipelines, and better evaluation standards.
Limitations and defensive commitments
OpenAI notes several limitations: the dataset is drawn largely from Code4rena-style competitions, detect mode cannot reliably adjudicate novel findings beyond human-labeled issues, and exploit evaluation excludes some timing-sensitive or multi-chain conditions.
Alongside the release, OpenAI described an evidence-based cyber safety approach that combines model safeguards, monitoring, trusted access controls, and enforcement workflows. It also committed an additional $10M in API credits through its Cybersecurity Grant Program to support good-faith defensive research, especially for open source software and critical infrastructure.
Related Articles
OpenAI Developers said on March 6, 2026 that Codex Security is now in research preview. The product connects to GitHub repositories, builds a threat model, validates potential issues in isolation, and proposes patches for human review.
OpenAI says GPT-5.4 Thinking is shipping in ChatGPT, with GPT-5.4 also live in the API and Codex and GPT-5.4 Pro available for harder tasks. The launch packages reasoning, coding, and native computer use into a single professional-work model with up to 1M tokens of context.
Google AI Developers has released Android Bench, an official leaderboard for LLMs on Android development tasks. In the first results, Gemini 3.1 Pro ranks first, and Google is also publishing the benchmark, dataset, and test harness.
Comments (0)
No comments yet. Be the first to comment!