Auditing the Prover: What LLMs Can and Can't Do for Zero-Knowledge Security

Why zero-knowledge auditing is a different problem

Most smart-contract bugs are bugs of commission: a missing access check, a reentrancy, an unchecked return. You can usually point at the offending line. Zero-knowledge circuit bugs are overwhelmingly bugs of omission — a constraint that was never written. Nothing on the screen is wrong. Something is simply absent, and a malicious prover gets to choose what fills the gap.

That single fact reshapes everything about auditing these systems, including where AI helps and where it quietly fails.

A circuit defines a relation between signals. The prover supplies a witness — concrete values for every signal — and a proof attests that those values satisfy every constraint. The verifier never sees the witness; it checks only the proof. So the security question is never “does the happy path work?” It is: what is the full set of witnesses the constraints accept, and is any member of that set a lie?

Two properties matter, and they fail in opposite directions:

Completeness — honest proofs are accepted. Break it and you get a liveness bug: legitimate users can’t transact.
Soundness — dishonest proofs are rejected. Break it and you get a fund-loss bug: an attacker mints, double-spends, or forges.

Soundness is the property that pays for exploits, and it is precisely the one you cannot test your way to. Test vectors run the honest witness generator, so every output comes back correct and every test is green — while an under-constrained signal sits there, exploitable, invisible to the suite.

The canonical example

template BadSquare() {    signal input in;    signal output out;     out  in * in;     // assigns, constrains nothing    // out === in * in;   // the missing constraint}

Hover or tap the <-- operator.

The <-- operator assigns a value during witness generation — it adds no constraint to the system. A prover running their own modified generator can put anything in out and still produce a valid proof. The fix is the === constraint (or <==, which does both). The tests pass either way.

This generalizes into a small set of bug classes that dominate real ZK incidents:

Under-constrained signals — values the prover can set freely.
The off-chain gap — checks enforced in the witness generator (JS/Rust/Go) but never re-asserted in-circuit; the attacker skips the generator entirely.
Proof replay and forgery — missing public-input binding, domain separation, or chain/contract/recipient binding; nullifier reuse.
Verifier ⇄ circuit drift — the deployed verifier, the verification key, and the circuit no longer describe the same statement.
zkVM program-ID substitution — a verifier that checks a proof is valid but not which program produced it (SP1/RISC0 image IDs).

Finding these requires reasoning about absence and about global structure: a single signal’s safety can depend on a constraint several files away. That is exactly where language models are weakest.

Where LLMs genuinely help

It’s easy to be either breathless or dismissive about LLMs in security. The accurate picture is narrower. Models are strong at:

Breadth and recall. A model has seen more vulnerability patterns than any individual auditor and never tires on hour six of a large codebase. For triage and first-pass enumeration, that breadth is real.
Intent reconstruction. Given a circuit and its surrounding code, a model is good at articulating what the author meant to prove — the necessary baseline for spotting what they failed to enforce.
Hypothesis generation. “Which signals look like they could be unconstrained?” is a generative question, and models are good generators. They surface candidates a human skims past.
Cross-domain translation. Mapping a witness-generator assumption to the in-circuit constraint that should enforce it — or noticing one is missing — is pattern work models do well when pointed at it.

Every one of these is a candidate-producing strength. None is a truth-producing strength. That distinction is the whole game.

Where LLMs fail in security work

Deploy a raw model against circuits and you meet these failure modes within an afternoon:

Reasoning about absence is hard. Models pattern-match what is present. “There is no constraint binding this signal” is a claim about the entire constraint set, not about any token in the file — and models react to text far better than they prove a negative over a global structure.
Field and modular arithmetic errors. Soundness often hinges on behavior near the field modulus, on overflow, or on non-unique representations. Models routinely reason as if values were ordinary integers.
Confident false positives. A model will narrate a plausible exploit for a constraint that is present and correct. A stream of confident-but-wrong findings is not a minor annoyance — it destroys reviewer trust and buries the real bug.
Lost global state. Large circuits exceed what a model reliably tracks at once. Which signals are constrained, by which templates, after n files? That bookkeeping degrades precisely where the hard bugs live.
Happy-path anchoring. Ask “does this work?” and a model reasons about the honest prover. Soundness requires reasoning about the adversarial one, and models drift back to the cooperative reading unless actively held there.
Non-determinism. Run the same prompt twice, get two different finding sets. For an audit deliverable, non-reproducibility is disqualifying on its own.

An unscaffolded model is a fast, well-read, distractible reviewer who hands you ten findings, three of which are imaginary, and cannot tell you which three.

Why the harness matters more than the model

The interesting engineering question is therefore not “which model?” It is: what do you build around the model so its candidate-producing strengths are preserved and its truth-producing weaknesses are neutralized? That scaffolding — the harness — matters more than the choice of base model, which is one reason a serious pipeline is deliberately model-agnostic.

A harness is the structure, tools, and control flow surrounding the model. The principles below are general; they apply to any rigorous AI-for-security system.

1. Decompose open-ended judgment into bounded, checkable subtasks. “Audit this circuit” is unanswerable. “For signal x, enumerate every constraint in which it appears and decide whether those constraints uniquely determine it given the public inputs” has a checkable answer. Walking the model through vulnerability classes explicitly — rather than hoping it reasons its way to a methodology — is the single biggest lever on both recall and false-positive rate, because it removes the guesswork about what to even examine.

2. Ground every claim in a tool, not in prose. This separates evidence from assertion. The model proposes; an external tool disposes. Compilers, witness-generator execution, static analyzers, fuzzers, and — for ZK especially — constraint solvers / SMT backends provide ground truth the model cannot fake. The model forms the hypothesis and the query; the tool returns an answer that is true regardless of how confidently the model argued.

3. Make confirmation the bar for acceptance. A candidate under-constraint is not a finding. The bar is non-uniqueness: construct two distinct witnesses that satisfy every constraint for the same public inputs. If both are accepted, the circuit provably fails to determine the output — a soundness break demonstrated rather than asserted, with concrete field values attached. (This determinism/uniqueness framing is well established; Veridise’s Picus and related work formalize it.) Where the structure permits, a formal proof of non-uniqueness is stronger still. The model’s opinion never reaches the report — only a confirmed counterexample or proof does. That verification gate is what collapses the false-positive rate.

4. Externalize state the model can’t hold. Constraint coverage, signal-to-constraint maps, which assumptions are enforced where — track these in explicit structures the harness maintains, not in the context window. The model queries and updates that state; it doesn’t have to remember it.

5. Engineer for reproducibility. Determinism in the harness — fixed task order, the same checks every run, confirmation required for every finding — buys back the reproducibility the model alone can’t provide. Two runs over the same code converge on the same confirmed findings, because findings are gated by tools, not by sampling.

The throughline: the model supplies breadth and hypotheses; tools supply ground truth; structure keeps both honest. A finding is only as trustworthy as the non-model evidence behind it.

A reliable pass, end to end

The shape of a dependable analysis, stripped to its logic:

The model reads a template and flags out as a candidate: it’s assigned but no constraint appears to pin it.
The harness doesn’t take the model’s word for it. It extracts the actual constraint set for out and hands the solver a query: does a second satisfying assignment exist with the same public I/O?
The solver returns two concrete witnesses — out = 4 and a second, field-dependent value — both accepted by the circuit for the same input.
The harness records a confirmed soundness finding with those values and a minimized reproduction. Had the solver instead proven uniqueness, the candidate is dropped silently — no false positive ever surfaces.

Each step the model is good at (noticing, hypothesizing) is delegated to the model; each step that requires truth (does a second witness exist?) is delegated to something that cannot hallucinate.

The limits that remain

Scaffolding is not magic, and the failure modes it doesn’t solve matter as much as the ones it does:

Tool coverage is finite. Solver-backed confirmation is tractable for many circuit patterns and intractable for others; some proof systems and gadgets are far harder to reason about automatically than Circom R1CS. Where the tools can’t reach, you are back to model judgment plus human review.
The model still has to propose the right hypothesis. If it never flags a signal, no confirmation machinery ever examines it. Harnessing raises precision sharply and improves recall, but it does not make recall total.
Formal methods don’t scale uniformly. Non-uniqueness queries can blow up; minimization and timeouts are real engineering constraints, not footnotes.
Specification is the hard part. “Is this signal constrained?” is checkable. “Is this the right thing to constrain for this protocol’s threat model?” is a question about intent that neither the model nor the solver can answer alone.
Cost and latency. Decomposition, tool calls, and confirmation loops cost far more than a single model pass. For audits that trade is worth making — but it is a trade.

Why humans stay in the loop

None of this points to “AI replaces auditors.” It points to a sharper division of labor than either the hype or the skepticism allows:

The harnessed model covers breadth, triages noise, enumerates constraint coverage, and ships confirmed findings with reproductions — quickly and reproducibly.
Tools supply ground truth, so the deliverable contains evidence rather than opinion.
Humans spend scarce attention where judgment is irreducible: the threat model, the protocol’s intent, economic and cross-system soundness, and the novel bug classes for which neither the model nor the tools have a prior. The hardest ZK bugs are not “this signal is unconstrained” — they are “the circuit is internally consistent and still proves the wrong statement.” Catching that requires understanding what the system is for, and that remains human work.

For zero-knowledge systems specifically — where the dangerous bug is an absence, the adversary controls the witness, and you cannot test your way to soundness — this configuration is the only one we’ve found that is both fast and trustworthy. The model is not the auditor. The harness makes the model useful; the human decides what “correct” means.

AuditAid audits circuits, provers, and verifiers end-to-end across Circom, Noir, Halo2, and zkVMs. More on the approach: zero-knowledge circuit audits.