OpenAI and Paradigm have introduced EVMbench, a new benchmark designed to measure whether AI agents can reliably audit smart contracts, patch critical bugs, and even exploit vulnerabilities in a controlled setup.
Quick Summary – TLDR:
- EVMbench tests AI agents in three modes: detect, patch, and exploit smart contract vulnerabilities.
- The benchmark uses 120 curated vulnerabilities pulled from 40 audits, many tied to public audit competitions.
- Early results show big gains in exploit tasks, with GPT 5.3 Codex scoring 72.2% in exploit mode, but detection and patching still lag.
- OpenAI is positioning the release as both a measurement tool and a push for defensive AI security work in crypto.
What Happened?
OpenAI, working with crypto investment firm Paradigm, launched EVMbench, a benchmark built to evaluate how well AI agents handle serious vulnerabilities in Ethereum Virtual Machine smart contracts. It focuses on real audit discovered issues and tests whether models can find them, fix them, and exploit them inside a sandbox environment.
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
Why OpenAI Is Focusing on Smart Contracts Now?
Smart contracts power decentralized exchanges, lending protocols, and many other onchain financial tools. The problem is that once a contract is deployed, it is often difficult or impossible to change, which makes bugs especially dangerous.
OpenAI framed the stakes clearly in its announcement, writing:
That message also hints at the bigger concern: as models get better, they can help defenders move faster, but they can also help attackers scale up.
What EVMbench Includes?
EVMbench pulls from 120 vulnerabilities across 40 audits, with many sourced from open audit competitions, including Code4rena style environments. OpenAI and Paradigm also added scenarios from security work tied to the Tempo blockchain, a purpose built Layer 1 chain designed for stablecoin payments.
That Tempo addition is not just a random extra. OpenAI suggests payment focused contract code could become more important as agent driven stablecoin payments grow, so it wanted the benchmark grounded in that real world direction.
To make the tasks usable for consistent scoring, the team adapted existing proof of concept exploit tests and deployment scripts where available. When those did not exist, they wrote them manually. For patch tasks, they made sure each bug was genuinely exploitable and could be mitigated without introducing changes that break compilation or intended behavior.
The Three Modes: Detect, Patch, Exploit
EVMbench tests three capability modes that map to real security workflows.
- Detect: Agents audit a contract repository and are scored on recall of known vulnerabilities and associated reward context.
- Patch: Agents modify vulnerable contracts, with success judged by whether they eliminate exploitability while keeping intended functionality, verified by tests and exploit checks.
- Exploit: Agents attempt full attacks that drain funds from deployed contracts in a sandbox chain, graded via transaction replay and onchain verification.
To keep the evaluation objective and repeatable, OpenAI built a Rust harness that deploys contracts, replays agent transactions deterministically, and restricts unsafe RPC methods. Exploit tasks run inside a local Anvil environment instead of live networks, and the benchmark uses historical, publicly documented vulnerabilities.
Early Results Show Strength in Attacks, Not Yet in Fixes
The most attention grabbing number is in exploit mode. OpenAI reports that GPT 5.3 Codex, running via Codex CLI, scored 72.2%. That is a major jump compared with GPT 5, which scored 31.9% and was released a little over six months earlier.
But the benchmark also highlights weaknesses that matter for defenders. OpenAI says detection recall and patch success remain below full coverage, and many vulnerabilities are still difficult for models to find and fix.
It also spotted a behavior gap. In exploit mode, the objective is simple: keep trying until the funds are drained. In detect mode, agents sometimes stop after finding one issue instead of auditing thoroughly. In patch mode, models often struggle to remove subtle vulnerabilities without breaking the contract.
Limitations OpenAI Admits Up Front
OpenAI stresses that EVMbench does not capture the full complexity of real world smart contract security. Many vulnerabilities come from audit competition settings, which are realistic and high impact, but top protocols often face deeper scrutiny and may be harder to exploit.
There are also constraints in how exploit tasks are graded. Transactions are replayed sequentially, which means timing dependent behaviors are not covered. The chain state is a clean local instance rather than a mainnet fork, and the benchmark currently supports single chain environments, sometimes requiring mock contracts.
In detect mode, scoring is tied to what human auditors previously found. If an AI flags extra issues, the benchmark cannot reliably tell if that is a true new bug or a false alarm.
SQ Magazine Takeaway
I like EVMbench because it stops the hand waving and forces real numbers. If AI is going to be used in crypto security, we need proof it can do more than spot obvious patterns in code. The results are also a little scary: models are getting good at exploitation faster than they are getting good at careful auditing and safe patching. That is exactly the wrong direction if teams treat AI as a shortcut for security reviews. My view is simple: use AI heavily, but use it as an assistant that helps humans move faster, not as a replacement for real audits.