OpenAI and Paradigm introduce EVMbench for smart contract security testing

Why EVMbench was introduced

OpenAI and Paradigm announced EVMbench to measure AI agent capabilities in a high-stakes cybersecurity domain: Ethereum Virtual Machine smart contracts. OpenAI argues this area is economically meaningful because smart contracts secure more than $100B in open-source crypto assets, and improvements in agentic coding systems can benefit both defenders and attackers.

The benchmark includes 120 curated high-severity vulnerabilities from 40 audits, mostly sourced from open audit competitions, with additional scenarios inspired by security work on Tempo, a payment-oriented blockchain. This mix is intended to represent realistic exploit and remediation patterns while keeping evaluation reproducible.

How the benchmark works

EVMbench evaluates agents in three modes. In detect, agents audit repositories and are scored on recall of known vulnerabilities. In patch, agents must eliminate exploitability while preserving intended functionality, validated with automated tests. In exploit, agents execute end-to-end fund-draining attacks in a sandboxed blockchain environment.

OpenAI built a Rust-based harness to keep grading deterministic, replay transactions, and restrict unsafe RPC behavior. Exploit tasks run in an isolated local Anvil chain rather than live networks, and tasks are based on historical publicly documented vulnerabilities.

Results and what they imply

OpenAI reports GPT-5.3-Codex reached 72.2% in exploit mode, compared with 31.9% for GPT-5 released roughly six months earlier. The company says detect recall and patch success remain below full coverage, indicating that exhaustive auditing and safe remediation are still difficult for current agents even as exploit execution improves quickly.

This asymmetry matters for risk management. If exploit competence rises faster than reliable patching, the defensive ecosystem needs stronger automation, faster review pipelines, and better evaluation standards.

Limitations and defensive commitments

OpenAI notes several limitations: the dataset is drawn largely from Code4rena-style competitions, detect mode cannot reliably adjudicate novel findings beyond human-labeled issues, and exploit evaluation excludes some timing-sensitive or multi-chain conditions.

Alongside the release, OpenAI described an evidence-based cyber safety approach that combines model safeguards, monitoring, trusted access controls, and enforcement workflows. It also committed an additional $10M in API credits through its Cybersecurity Grant Program to support good-faith defensive research, especially for open source software and critical infrastructure.

OpenAI and Paradigm introduce EVMbench for smart contract security testing

Why EVMbench was introduced

How the benchmark works

Results and what they imply

Limitations and defensive commitments

Related Articles

Anthropic’s vuln harness is more workshop jig than boxed scanner

OpenAI puts Lockdown Mode in ChatGPT as agent security gets practical

OpenAI launches Codex Security research preview for validated code vulnerability remediation

Related Articles

Anthropic’s vuln harness is more workshop jig than boxed scanner

OpenAI puts Lockdown Mode in ChatGPT as agent security gets practical

OpenAI launches Codex Security research preview for validated code vulnerability remediation
LLM X/Twitter Mar 7, 2026 2 min read