Benchmarks & Evaluation

We don't have published benchmarks yet. Here's why, what honest benchmarks would look like, and how to evaluate Claudit on your own code right now.

Benchmarks in progress — this page will be updated when results are available

Why we don't have published benchmarks yet

Publishing a number without context is worse than publishing nothing. Here's what honest benchmarking for a code auditing tool requires — and why it takes time to do right:

Ground truth is hard. To measure precision and recall, you need a labelled dataset of real codebases with known vulnerabilities — not synthetic examples. Synthetic benchmarks (like many LLM coding evals) don't predict real-world performance.
Domain specificity matters. A headline accuracy number across all 125+ auditors is misleading. Security accuracy, accessibility accuracy, and SEO accuracy are different measurements with different baselines.
False positive rate matters as much as detection rate. A tool that flags everything has a high detection rate and zero usefulness. We need to measure both.
Models change. Any benchmark taken today is a snapshot. We need a repeatable evaluation harness, not a one-time number.

What we're building

Labelled vulnerability dataset

In progress

A set of real-world code samples with known issues — drawn from public CVE-referenced commits, deliberately introduced vulnerabilities, and manually reviewed codebases. Each sample is labelled with finding type, severity, and file/line location.

Per-auditor precision / recall metrics

Planned

Separate metrics for each domain: Security, Accessibility (WCAG), Performance, SEO. Reported with confidence intervals, not just point estimates.

False positive rate

Planned

Measured against known-clean code. The ratio of flagged issues that are not real problems — arguably the most important metric for developer trust.

Reproducible harness

Planned

An evaluation script runnable against any Claude model version so we can track performance as the underlying model changes.

How to evaluate Claudit on your own code

The most meaningful benchmark is performance on your codebase. Here's a structured approach:

Pick a codebase you know well

Use a project where you already know about existing issues — bugs you've fixed, security issues that came up in a previous review, or WCAG failures you've already confirmed.

Run the relevant auditors

Use the Quick Scan preset for broad coverage, or target specific auditors (Security, Accessibility, Performance) against the relevant code. Paste the files most likely to contain issues.

Score detection vs. false positives

Count: (a) known issues that were correctly flagged, (b) known issues that were missed, (c) findings that were wrong or inapplicable. That gives you precision and recall for your specific context.

Try a known-vulnerable sample

Paste intentionally vulnerable code — SQL injection, missing alt text, render-blocking resources — and verify the auditor catches it with the correct severity and location.

Known limitations that affect accuracy

Runtime behavior. Race conditions, production load issues, and environment-specific failures can't be inferred from static code alone. You can narrow this gap by pasting a stack trace or error log into the runtime context field — auditors will factor confirmed failures into their findings rather than flagging theoretical risks.
Deep business context. Auditors don't know your domain rules, compliance obligations, or internal conventions unless you tell them. Use the workspace context field in Settings to describe your stack, standards (OWASP, GDPR, HIPAA), and conventions — it's injected into every audit automatically.
Input size. Auditors process up to 120,000 characters per submission. For large files, a structural skeleton (function and class signatures) is extracted first so auditors can navigate the full shape of the code — but very large monorepos still require splitting by module or layer.
Confidence tiers. Findings are tagged CERTAIN, LIKELY, or POSSIBLE. POSSIBLE findings have a higher false positive rate by design — they surface candidates for human review, not confirmed bugs.
Model variability. Like all LLM-based tools, output can vary between runs on identical input. Critical findings are generally stable; edge cases may not be.

See our FAQ for a full breakdown of where Claudit works well and where it doesn't.

Run your own evaluation now

No account required. Paste code you know has issues and see what Claudit finds.

Run a free audit

Benchmarks & Evaluation

We don't have published benchmarks yet. Here's why, what honest benchmarks would look like, and how to evaluate Claudit on your own code right now.

Benchmarks in progress — this page will be updated when results are available

Why we don't have published benchmarks yet

Publishing a number without context is worse than publishing nothing. Here's what honest benchmarking for a code auditing tool requires — and why it takes time to do right:

Ground truth is hard. To measure precision and recall, you need a labelled dataset of real codebases with known vulnerabilities — not synthetic examples. Synthetic benchmarks (like many LLM coding evals) don't predict real-world performance.
Domain specificity matters. A headline accuracy number across all 125+ auditors is misleading. Security accuracy, accessibility accuracy, and SEO accuracy are different measurements with different baselines.
False positive rate matters as much as detection rate. A tool that flags everything has a high detection rate and zero usefulness. We need to measure both.
Models change. Any benchmark taken today is a snapshot. We need a repeatable evaluation harness, not a one-time number.

What we're building

Labelled vulnerability dataset

In progress

Per-auditor precision / recall metrics

Planned

Separate metrics for each domain: Security, Accessibility (WCAG), Performance, SEO. Reported with confidence intervals, not just point estimates.

False positive rate

Planned

Measured against known-clean code. The ratio of flagged issues that are not real problems — arguably the most important metric for developer trust.

Reproducible harness

Planned

An evaluation script runnable against any Claude model version so we can track performance as the underlying model changes.

How to evaluate Claudit on your own code

The most meaningful benchmark is performance on your codebase. Here's a structured approach:

Pick a codebase you know well

Use a project where you already know about existing issues — bugs you've fixed, security issues that came up in a previous review, or WCAG failures you've already confirmed.

Run the relevant auditors

Use the Quick Scan preset for broad coverage, or target specific auditors (Security, Accessibility, Performance) against the relevant code. Paste the files most likely to contain issues.

Score detection vs. false positives

Count: (a) known issues that were correctly flagged, (b) known issues that were missed, (c) findings that were wrong or inapplicable. That gives you precision and recall for your specific context.

Try a known-vulnerable sample

Paste intentionally vulnerable code — SQL injection, missing alt text, render-blocking resources — and verify the auditor catches it with the correct severity and location.

Known limitations that affect accuracy

Runtime behavior. Race conditions, production load issues, and environment-specific failures can't be inferred from static code alone. You can narrow this gap by pasting a stack trace or error log into the runtime context field — auditors will factor confirmed failures into their findings rather than flagging theoretical risks.
Deep business context. Auditors don't know your domain rules, compliance obligations, or internal conventions unless you tell them. Use the workspace context field in Settings to describe your stack, standards (OWASP, GDPR, HIPAA), and conventions — it's injected into every audit automatically.
Input size. Auditors process up to 120,000 characters per submission. For large files, a structural skeleton (function and class signatures) is extracted first so auditors can navigate the full shape of the code — but very large monorepos still require splitting by module or layer.
Confidence tiers. Findings are tagged CERTAIN, LIKELY, or POSSIBLE. POSSIBLE findings have a higher false positive rate by design — they surface candidates for human review, not confirmed bugs.
Model variability. Like all LLM-based tools, output can vary between runs on identical input. Critical findings are generally stable; edge cases may not be.

See our FAQ for a full breakdown of where Claudit works well and where it doesn't.

Run your own evaluation now

No account required. Paste code you know has issues and see what Claudit finds.

Run a free audit