An autonomous red-vs-blue agent system where the harness enforces four server-side gates the LLMs cannot see or modify. Every claim of convergence is backed by forge test on a fresh fork — never by LLM assertion.
For an apples-to-apples comparison to Anthropic's SCONE-bench (405 random contracts, 51.1% exploit rate), see the SCONE-bench paper. Their benchmark is random-draw at scale; the 67.7% number above is curated. The honest random-sample comparison is the 13% card on the right.
| Contract | Vuln Class | Rounds | Convergence | Notional Cost | Wall (sec) |
|---|
The honest read: when the full check chain ran (3 rounds: Dexible r1 + Floor Protocol r1 + Floor Protocol r2), all four gates passed unanimously — 12/12 individual checks. The gates are real; the question is how often you reach them. 5 of 10 contracts had Red find anything; 2 of those had Blue produce a passing patch. That's the funnel that matters.
The 32-contract benchmark over-represented verified-source contracts with single-contract attack vectors. The 95-contract random sample included unverified contracts, multi-protocol flash-loan exploits, and non-standard token patterns the sandbox doesn't yet handle. The gap is sandbox work, not model work.
That gap is the published origin story for Solhunt-Duel's four server-side verification gates — agents will lie about success if you let them, so all claims must be backed by verifiable execution.
--via-claude-cli. No marginal token cost.