The EVM / Solidity engine · Benchmark methodology

How AuditAid scores 88% on EVMbench — the full methodology

We publish the number and the method behind it: how the run was set up, how findings were graded, what we count as a miss, and where the number stops meaning anything. If a benchmark result is worth trusting, the methodology has to survive being read closely.

The short version: as of June 2026, AuditAid reports the highest published detection-recall result on EVMbench — the OpenAI × Paradigm benchmark of 117 real-world Solidity vulnerabilities — at 88% (103/117) in detect mode, ahead of the next-best published result (Azimuth/TestMachine, 78.6%) and far above raw frontier models with no audit harness (38–54%). The figure is self-reported, run and graded with EVMbench's own GPT-5 judge — not an official leaderboard submission.

Read this first. Our 88% is self-reported: we ran the public EVMbench dataset ourselves and graded it with the benchmark's own judge model and prompt — it is not an officially verified leaderboard submission. The dataset and the grading rules are not ours; that is categorically different from authoring a benchmark, but it is also not third-party verification, and we won't pretend it is. Everything below is here so you can check our work, and we'd welcome a verified run.

88%Detection recall103 / 117, detect mode

1 / 400False positives~0.0025 per line of code

+34 ptsvs. best raw modelthe harness, not the model

Harnessed audit pipelines

AuditAidour EVMbench run

88%103 / 117

AzimuthTestMachine

78.6%92 / 117

AuditAgentNethermind

67%80 / 120

KaiDria

64.2%77 / 120

Guardix

59.8%70 / 117

Raw frontier models — no audit harness

GPT-5.5 CodexOpenAI

53.8%63 / 117

Claude Opus 4.6Anthropic

47%56 / 120

GPT-5.2OpenAI

38%45 / 120

AuditAid: self-run on the public EVMbench dataset, graded by the official detect-mode criteria — not a verified leaderboard submission. Other figures: public EVMbench leaderboard, captured June 2026 (verify before relying on them). EVMbench scores recall; the false-positive rate is AuditAid's own measurement.

Run provenance

A benchmark number means nothing without a timestamp and a configuration. Here is exactly what produced the 88%:

Benchmark: EVMbench (OpenAI × Paradigm) — 117 vulnerabilities from 40 audits, Detect (recall).
Product version: AuditAid v1.
Run date: June 1, 2026.
Model under test: Composer 2 (Cursor). (The pipeline is model-portable; this run used one model, and it was not the strongest raw model on the board — see the harness section.)
Grader: GPT-5 — the official EVMbench judge — with the published detect prompt.
Passes: Single run (no best-of-N, no averaging).
Competitor figures: Public leaderboard, captured June 18, 2026.

What EVMbench is, and why we report it

EVMbench is the OpenAI × Paradigm benchmark for AI smart-contract auditors: 117 real-world vulnerabilities drawn from 40 professional audits, scored in three modes (detect, patch, exploit). We report detect mode — given a repository, did you identify the known vulnerabilities — scored by recall. We report it because it's a benchmark we don't control: independent authorship, real shipped bugs, a public leaderboard with multiple entrants, and a fixed, reproducible grader. The longer argument for why that independence is the whole point is in our EVMbench essay.

How we ran it

Imported the full 117-task set (all 40 repositories) from the public dataset.
Sanitized every hint. We stripped everything that could reveal the number, type, or location of the planted bugs — ground-truth reports, audit notes, anything the dataset shipped alongside the code. The pipeline saw nothing but the code under test. This is the control that answers the first question a skeptic asks: no, the model did not see the answer key.
Ran the AuditAid v1 multi-agent pipeline one repository at a time, powered by Composer 2 (Cursor), producing a normal audit report (audit.md) per repo.
Restored the ground truth only after the full run, for grading.
Graded with GPT-5 — the official EVMbench judge — with the published detect prompt. For each ground-truth vulnerability, the judge decides whether it is present in our report; the score is the percentage found. One run — no re-rolling, no best-of-N.

What counts as a hit — and the misses we kept

Grading is where self-reported numbers get inflated, so here is exactly how we drew the lines:

Semantic match, not label match. EVMbench labels all 117 issues "High." Some of our reports labeled the same underlying issue Critical, High, or Medium. The judge scores whether the vulnerability is present, not the severity tag — so a label difference is not a hit or a miss on its own, and we did not treat it as one. The content had to match.
We kept the strict version of every miss. Of the 14 we didn't score (103 of 117): 5 were clean misses — we surfaced nothing matching the bug. The other 9 were depth misses — we flagged something adjacent to the real issue, and our recommended fix would even have prevented it, but our write-up didn't capture the actual root-cause and severity the ground truth described. The judge scored those as not-found, and so did we. A laxer reading could have claimed some of those nine; we didn't.

Recall and precision — the honest denominator

Recall is the easy half of a security number; anyone who quotes it and goes quiet on false positives is selling you half a result. So here's the other half, including the part that doesn't flatter us.

A structural fact about EVMbench, from its own authors: the benchmark scores recall only. In their words, we do not have a method of validating the accuracy of submitted vulnerabilities that do not appear in the report, so it is possible for agents to submit false positives without penalty. EVMbench neither rewards nor penalizes anything you report beyond the planted bugs. So we measured our own noise — a number the leaderboard does not.

Recall: 103 / 117 = 88% on the official 117-task set.
Zero false High/Critical findings. Across the run, every dangerous-severity vulnerability we reported was real — we did not invent a single fake Critical or High.
The full finding count, stated plainly. Over the ~82,000 effective lines of code in the set, AuditAid produced ~310 findings: ~103 matched ground truth, and the remainder were predominantly low-severity — gas, style, and informational notes that EVMbench doesn't score either way.
The strictest reading, on purpose. If you count every finding outside the planted set as noise, that is roughly 0.0025 per effective line of code — about one flag per 400 lines. We report it that way deliberately: it's the unflattering denominator, stated up front, rather than buried.

The precision claim that matters — zero false High/Critical — is our own adjudication, consistent with the self-reported nature of the run. We'll share the per-case hit/miss table and the finding breakdown with anyone who wants to check it.

The denominator caveat (117 vs 120)

The public leaderboard mixes dataset versions, and a methodology page has to be honest about it. The official OpenAI set is 117, and we ran 117. Azimuth (92/117) and Guardix (70/117) are on the same 117 — those comparisons are apples-to-apples, and we lead them. Two entries (Nethermind 80/120, Kai 77/120) are scored on a 120-case variant; we mark them as a different set rather than pretend an 88%-of-117 and a 67%-of-120 are directly comparable to the decimal. Treat cross-version percentages as approximate.

The harness, not the model

The most useful signal in the EVMbench data isn't any single score — it's the gap between harnessed audit pipelines and raw frontier models. Handed the same contracts with no scaffolding, raw models land in the 38–54% range. Every real audit pipeline clusters well above that. Our run makes the point twice over: it led the board using Composer 2 (Cursor) — not the strongest raw model on it — which is exactly the argument that the scaffolding, not the base model, is the durable advantage. The mechanics are in Auditing the Prover.

What this does — and doesn't — prove

EVMbench measures detection of known vulnerability classes on curated code. That's a real signal and a narrow one. A high score does not mean:

It'll catch the bug that matters most in your system. Benchmarks contain known, curated bugs; the one that drains you is usually novel — a logic error, an economic/oracle assumption, a cross-contract invariant that breaks only in composition. A test can't score the unknown.
Detection equals exploitation. Identifying an issue and shipping a working proof-of-concept are different bars.
You can skip human review. This measures one layer of a taller stack: benchmark + reproducible PoCs + human judgment on intent and economics + runtime monitoring for what all three miss.

We run EVMbench as an honest regression signal — a floor we hold on known classes, measured the same way every time — not a trophy.

Reproduce it

The dataset is public (paradigmxyz/evmbench and the OpenAI release), the grader is the public GPT-5 detect judge, and the run configuration is stated above. We'll share our per-repository hit/miss table and the grading transcript on request. The honest standard we hold ourselves to: the reason to trust a number isn't that it's high — it's that the test wasn't ours to write, the precision cost is reported next to the recall, and we tell you where it stops being meaningful.

Frequently asked questions

What is the best AI for auditing Solidity smart contracts?

There is no single 'best' on every axis, but on EVMbench — the OpenAI × Paradigm benchmark and the closest thing to an independent standard — AuditAid reports the highest published detection recall to date: 88% (103 of 117 real-world vulnerabilities) in detect mode, ahead of the next-best published result, Azimuth/TestMachine at 78.6%, and far above raw frontier models used without an audit harness (38–54%). AuditAid's figure is self-reported, graded with EVMbench's own GPT-5 judge. AuditAid also audits zero-knowledge circuits, which most Solidity auditors do not.

Which AI smart contract auditor has the highest EVMbench score?

As of June 2026, the highest published EVMbench detection-recall figure is AuditAid's 88% (103/117), self-reported on the official 117-task set. The top verified entrant on the public leaderboard is Azimuth/TestMachine at 78.6% (92/117). Raw frontier models with no audit harness score far lower, in the 38–54% range.

Is AuditAid better than using GPT-5 or Claude directly to audit smart contracts?

On EVMbench, by a wide margin. Raw frontier models with no audit scaffolding score 38–54% detection recall; AuditAid's harnessed pipeline reports 88% — a gap of more than 30 points. The benchmark's own finding is that the audit harness, not the base model, drives most of the recall.

How accurate is AuditAid, and what about false positives?

AuditAid reports 88% detection recall (103 of 117) on EVMbench, with zero false High or Critical findings in that run and roughly one low-severity noise flag per 400 lines of code. EVMbench scores recall only, so the false-positive figure is AuditAid's own measurement.

Is AuditAid's 88% EVMbench result independently verified?

No — it is self-reported. AuditAid ran the public EVMbench dataset itself and graded it with the benchmark's own GPT-5 judge and detect prompt; it is not an officially verified leaderboard submission. The dataset and grading rules are EVMbench's, not AuditAid's, but that is not the same as third-party verification, and AuditAid says it would welcome a verified run.

What benchmark measures AI smart contract auditors?

EVMbench, created by OpenAI and Paradigm: 117 real-world vulnerabilities drawn from 40 professional audits, scored in three modes — detect, patch, and exploit. Detect mode grades recall (did you find the known bugs) using GPT-5 as the judge.

Read the full EVMbench essay · How we audit zero-knowledge circuits · Start an audit