We don't have published benchmarks yet. Here's why, what honest benchmarks would look like, and how to evaluate Claudit on your own code right now.
Publishing a number without context is worse than publishing nothing. Here's what honest benchmarking for a code auditing tool requires — and why it takes time to do right:
Labelled vulnerability dataset
In progressA set of real-world code samples with known issues — drawn from public CVE-referenced commits, deliberately introduced vulnerabilities, and manually reviewed codebases. Each sample is labelled with finding type, severity, and file/line location.
Per-auditor precision / recall metrics
PlannedSeparate metrics for each domain: Security, Accessibility (WCAG), Performance, SEO. Reported with confidence intervals, not just point estimates.
False positive rate
PlannedMeasured against known-clean code. The ratio of flagged issues that are not real problems — arguably the most important metric for developer trust.
Reproducible harness
PlannedAn evaluation script runnable against any Claude model version so we can track performance as the underlying model changes.
The most meaningful benchmark is performance on your codebase. Here's a structured approach:
Pick a codebase you know well
Use a project where you already know about existing issues — bugs you've fixed, security issues that came up in a previous review, or WCAG failures you've already confirmed.
Run the relevant auditors
Use the Quick Scan preset for broad coverage, or target specific auditors (Security, Accessibility, Performance) against the relevant code. Paste the files most likely to contain issues.
Score detection vs. false positives
Count: (a) known issues that were correctly flagged, (b) known issues that were missed, (c) findings that were wrong or inapplicable. That gives you precision and recall for your specific context.
Try a known-vulnerable sample
Paste intentionally vulnerable code — SQL injection, missing alt text, render-blocking resources — and verify the auditor catches it with the correct severity and location.
See our FAQ for a full breakdown of where Claudit works well and where it doesn't.
Run your own evaluation now
No account required. Paste code you know has issues and see what Claudit finds.
Run a free audit