NEWS Neural networks are taking the Ethereum "high road." OpenAI tested how AI is robbing smart contracts.

pinkman · Feb 20, 2026

The benchmark evaluates the detection, fixing, and exploitation of 120 real-world vulnerabilities from audits and competitions.

Smart contracts have long underpinned the crypto economy, worth hundreds of billions of dollars, and now not only humans but also AI agents capable of reading code, writing it, and launching near-turnkey attacks are approaching this system. To understand how far they've come and where the real dangers lie, OpenAI, together with Paradigm, have introduced EVMbench, a benchmark for assessing how AI finds, repairs, and exploits critical vulnerabilities in Ethereum Virtual Machine contracts.

EVMbench was based on 120 selected vulnerabilities from 40 audits, most of which came from open code audit competitions. Separately, we included several scenarios from the security audit process of the Tempo blockchain, which was conceived as a Layer 1 for fast and low-cost stablecoin payments. The logic is clear: if agent payments via stablecoins truly become widespread, then "payment" smart code will be among the most interesting targets.

The creators say they tried to make the tasks as practical as possible. Where ready-made proof-of-concept and deployment scripts existed, they adapted them. Where none existed, they wrote them manually. In patch mode, they checked that the vulnerability was actually exploitable and that it could be patched without breaking the build, otherwise the test environment would become a mock one. For exploit mode, they wrote their own verification mechanisms and separately red-teamed the environment to prevent the agent from bypassing the check with any unobvious tricks. In addition to Paradigm's expertise, they used automated "task auditors" to increase the reliability of the tests themselves.

The benchmark tests three types of skills. The first is discovery: the agent analyzes the repository and is scored based on how well it finds vulnerabilities that have already been confirmed by humans, as well as based on "auditor rewards." The second is patching: the agent modifies contracts to preserve expected behavior while preventing exploitation, which is verified through automated tests and hacking attempts. The third is exploitation: the agent conducts an attack until funds are fully withdrawn in a sandbox, and the result is programmatically confirmed through transaction replay and on-chain verification.

To ensure reproducibility of the measurements, the authors created a test harness in Rust that deploys contracts, deterministically reproduces agent actions, and restricts dangerous RPC methods. Exploitation tasks are run not on live networks, but in a local Anvil environment. It is emphasized that the vulnerabilities are historical and publicly described.

Based on tests on frontier agents, the agent's current strength is exploitation, where the target is extremely specific and the agent simply iteratively withdraws funds. In this mode, GPT-5.3-Codex, launched via the Codex CLI, scored 72.2%. By comparison, GPT-5 scored 31.9%, even though it was released just over six months ago. However, in search and fix tasks, it's far from achieving "full coverage": a significant portion of problems still remain challenging for agents. The authors separately describe a typical scenario: in search, the model sometimes finds a single vulnerability and stops instead of performing a full audit, while in patching, it stumbles over subtle bugs and the need to maintain consistent functionality.

The benchmark's limitations are also openly acknowledged. The set of vulnerabilities was taken from Code4rena competitions, and these are realistic high-severity cases. However, contracts running in production on "mainstream" protocols often undergo much more stringent and lengthy verification, meaning they may be more difficult to exploit. There are also methodological nuances: in search mode, the system checks for matches with what human researchers have found, and "additional" agent findings are currently difficult to automatically distinguish between real vulnerabilities and false positives. In production mode, transactions are replayed sequentially, so timing-critical scenarios are not covered. The network state is a pure local instance, not a mainnet fork, and only a single chain is supported, which sometimes necessitates the use of mock contracts instead of real deployments.

All of this leads to a simple idea: smart contracts protect colossal amounts of money, and AI will strengthen both attackers and defenders. Therefore, progress should be measured not in abstract "write a function" tasks, but in environments where mistakes are costly. At the same time, OpenAI describes its "defensive" use case: security training, monitoring, trusted access to advanced capabilities, and abuse response pipelines, as well as ecosystem initiatives such as expanding Aardvark's private beta and partnerships with open-source maintainers for free scanning of popular projects. Additionally, the company is promising $10 million in API credits through its Cybersecurity Grant Program to accelerate defensive research and tooling for critical systems and open-source software. Along with this, they are also releasing the tasks, tools, and EVMbench assessment framework so that researchers can continue to compare models and discuss risks on a common basis.

NEWS Neural networks are taking the Ethereum "high road." OpenAI tested how AI is robbing smart contracts.

pinkman

BOSS

Similar threads