Why EVMbench — What 88% Recall at Near-Zero False Positives Actually Means

The first benchmark result is the benchmark you choose

Every AI auditing tool quotes a number. Far fewer will tell you who wrote the test.

That omission matters more than the number, because a benchmark you author yourself isn't a measurement — it's a mirror. If you build the dataset, define what counts as a hit, and grade your own answers, then “we lead the benchmark” reduces to “we agree with ourselves.” The score can be perfectly real and still tell an outsider almost nothing, because the one variable that makes a benchmark meaningful — independence — was removed before the first contract was scored.

So before we talk about 88% recall, here's the claim we actually want to defend: report results on a benchmark you don't control. This post is why we report EVMbench, what 88% recall at roughly one false positive per 400 lines actually means, and the architecture that produces both halves of that — high detection and low noise.

What makes a benchmark trustworthy

A score is only as good as the test behind it. Four properties separate a benchmark from a press release:

Independent authorship. The people who built the test aren't the people being ranked first on it. A vendor topping its own benchmark has a conflict of interest baked into the dataset itself.
Real-world data. The vulnerabilities are drawn from code that actually shipped — not synthetic bugs hand-injected to be found.
A public leaderboard with multiple entrants. Other systems run the same test, so a number has peers to be compared against.
Transparent, reproducible grading. The criteria for “found it” are fixed and public, and the run can be repeated.

EVMbench has all four. A benchmark a vendor wrote for itself has, at best, the last one.

Why EVMbench is the closest thing to a standard

EVMbench evaluates AI smart-contract auditors against 117 real-world vulnerabilities pulled from production Solidity, with a public leaderboard that multiple independent teams submit to. It runs in detect mode — given the code, did you identify that the vulnerability is there — and scores by recall. The vulnerabilities are real bugs from real systems, which is what makes it a defensible proxy for “can this tool find the kind of thing that actually gets exploited.”

It isn't perfect — no benchmark is, and we'll get to its limits. But independent authorship + real-world data + a shared public leaderboard is why it has become the reference point this field actually compares itself on, rather than one more vendor scoreboard.

Our own honesty caveat, up front: our 88% is a number we produced by running the public EVMbench dataset ourselves, graded by the benchmark's own detection criteria — not an officially verified leaderboard submission. The distinction that matters: the dataset and the grading rules are not ours. That's categorically different from authoring the benchmark — but it's also not the same as independent third-party verification, and we won't pretend it is. We'd welcome a verified run, and we'll share our methodology with anyone who wants to check it.

What 88% recall actually means — and what it doesn't

Recall answers one question: of the vulnerabilities that are really there, how many did you find? Ours is 103 of 117 → 88%.

But recall alone is the most gameable number in security tooling. A “tool” that flags every line of every contract as vulnerable scores 100% recall and is worthless. So recall is only half a result; the other half is precision — of everything you flagged, how much was real. Anyone who quotes recall and goes quiet on false positives is selling you the easy half of the number.

Two honest caveats on the figure itself:

Denominators vary across the leaderboard (you'll see /117 and /120 as harness/dataset versions differ), so treat cross-entry percentages as approximate, not a ranking to the decimal.
Detection is not exploitation. 88% recall means we identified the vulnerabilities exist on a curated set of known bugs. It does not mean “we'll catch 88% of the bugs in your protocol” — your worst bug is the one nobody planted (more on that at the end).

The number that actually predicts whether anyone uses the tool

EVMbench scores recall; it does not score false positives. So the false-positive figure here is our own measurement, and you should read it as such: 0.0025 false positives per line of code — roughly one noise flag per 400 lines.

That is the number we optimize hardest for, because it's the one that decides whether a human ever trusts the output. Recall sets the ceiling on what a tool can surface; the false-positive rate sets how much of an auditor's day is spent dismissing things that aren't bugs. A 95%-recall tool that screams once every ten lines is shelfware — the reviewer learns to ignore it, and the one real finding dies in the noise. An 88%-recall tool at one false positive per 400 lines is something a senior auditor will actually keep open.

Low false positives aren't a nicety; they're the difference between a tool that saves time and one that relocates it.

88%Detection recall103 / 117, detect mode

1 / 400False positives~0.0025 per line of code

+34 ptsvs. best raw modelthe harness, not the model

Harnessed audit pipelines

AuditAidour EVMbench run

88%103 / 117

AzimuthTestMachine

78.6%92 / 117

AuditAgentNethermind

67%80 / 120

KaiDria

64.2%77 / 120

Guardix

59.8%70 / 117

Raw frontier models — no audit harness

GPT-5.5 CodexOpenAI

53.8%63 / 117

Claude Opus 4.6Anthropic

47%56 / 120

GPT-5.2OpenAI

38%45 / 120

AuditAid: self-run on the public EVMbench dataset, graded by the official detect-mode criteria — not a verified leaderboard submission. Other figures: public EVMbench leaderboard, captured June 2026 (verify before relying on them). EVMbench scores recall; the false-positive rate is AuditAid's own measurement.

Why the architecture finds so many — the harness, not the model

The most useful signal in the EVMbench data isn't any single score; it's the gap between harnessed systems and raw models. Hand a frontier model the same contracts with no scaffolding and recall lands in the 38–54% range. Every system built as an actual audit pipeline clusters far above that. The scaffolding is worth tens of points of recall.

That's the whole thesis of our earlier post on harnessing LLMs, showing up as a number. Without revealing the parts that are ours to keep, the principles that produce the recall are public enough to state plainly:

We don't ask a model to “audit this.” We walk it through vulnerability classes explicitly — removing the guesswork about what to even look for, which is the single biggest lever on recall.
The pipeline is model-agnostic. We engineer the scaffolding to be the durable advantage, not a bet on whichever base model is briefly ahead.
Coverage is enumerated, not hoped for. Structured decomposition means the system reasons about every entrypoint and every relevant class, instead of pattern-matching the first thing that looks wrong.

Breadth is where models genuinely help — a tireless reviewer that has seen more vulnerability patterns than any individual. The harness is what turns that breadth into coverage instead of noise.

How we squeeze out false positives that waste your time

Finding things is the easy half. The hard half — the half that determines the false-positive rate — is not reporting the things that aren't real. This is exactly where unscaffolded models fail: they'll narrate a confident, plausible exploit for code that is correct, and a stream of confident-but-wrong findings destroys trust faster than a missed bug.

Our false-positive rate comes from a single discipline: a finding is not accepted because the model is confident; it's accepted because it's been confirmed. Stated as principles:

Evidence over assertion. The model proposes a hypothesis; something external to the model — execution, analysis, reproduction — has to substantiate it before it reaches the report. A claim the tooling can't back doesn't ship.
Candidate ≠ finding. There's a deliberate gap between “this looks suspicious” and “this is a vulnerability.” Most of the noise other tools emit lives in that gap; we don't report from it.
Static-analysis output is an input, not a result. Scanner hits are triaged and confirmed or discarded — we don't pass raw detector output through to you as findings. On our EVMbench-class runs, the large majority of raw static-analysis hits are dismissed after verification.

The payoff is the precision side of the ledger: a report where the findings are worth reading because the noise was filtered before it reached a human, not after.

What a benchmark still can't tell you

The most important section, and the one most benchmark posts omit. EVMbench measures detection of known vulnerability classes on curated code. That is a real signal and a narrow one. A high score does not mean:

It'll catch the bug that matters most in your system. Benchmarks contain known, curated vulnerabilities. The one that drains you is usually novel — a logic error, an economic/oracle assumption, a cross-contract invariant that holds everywhere and breaks in composition. A test cannot score the unknown.
Every finding is exploitable. Detection and a working proof-of-concept are different bars.
You can skip human review. It measures one layer of a taller stack: benchmark + reproducible PoCs + human judgment on intent and economics + runtime monitoring for what all three miss.

We run EVMbench as an honest regression signal — a floor we hold on known classes, measured the same way every time — not a trophy. The reason to trust a number isn't that it's high; it's that the test wasn't ours to write, the false-positive cost is reported next to it, and we're telling you where it stops being meaningful.

That's the whole argument: choose the benchmark you don't control, demand recall and precision, and read every score — ours included — as the start of the questions, not the end.

See the methodology behind the detection rate in Auditing the Prover, or read how AuditAid audits zero-knowledge circuits.