The Accuracy Numbers Legal AI Vendors Cite Don't Mean What You Think

Legal AI vendors cite 90%+ accuracy rates, but what do those numbers actually measure? Discover why these claims are often unverifiable and what to ask.

The Accuracy Numbers Legal AI Vendors Cite Don't Mean What You Think

Every legal AI vendor has a number. Ninety-three percent accuracy. Ninety-seven percent. Some claim higher. The figures appear in sales decks, on landing pages, in pitches to law firm managing partners. They are almost always unverifiable, and the industry has quietly decided that's fine.

This is worth examining now, not because accuracy claims are new to software marketing, but because the stakes in legal AI are different and rising. Enterprise adoption is accelerating. Firms are embedding these tools in research workflows, document review, contract analysis, due diligence. Liability questions that were once theoretical are becoming concrete. And still, the number on the slide remains untested by anyone with an incentive to test it honestly.

The verification problem is not technically difficult. It is commercially inconvenient.

A standardised benchmark for legal AI accuracy would need to do several things: define what "accurate" means across different task types (retrieval, summarisation, reasoning, citation), establish test sets that are representative of real legal work rather than curated showcase queries, and be administered by someone other than the vendor whose product is being measured. None of this is beyond reach. The medical device industry manages it. Financial modelling tools face external audit requirements. There is no principled reason legal AI should be exempt.

The reason it remains exempt is that harder accuracy standards threaten the market in three ways simultaneously.

First, they slow deployment. If a vendor had to demonstrate verified accuracy on an independently constructed benchmark before selling to a large firm, development cycles lengthen and go-to-market timelines extend. In a market where first-mover advantage is real, that's a serious commercial problem.

Second, they flatten differentiation. Right now, "ninety-seven percent accurate" is a meaningful-sounding claim. If every vendor's product were tested against the same benchmark, the differences between them might turn out to be smaller than the marketing suggests, or larger, but in uncomfortable directions. Either outcome is bad for vendors who have built their pitch around the number.

Third, they force honest conversations about hallucination in high-stakes work. Legal AI systems still hallucinate. Not rarely: routinely, in ways that are not always obvious. A system that confidently cites a case that doesn't exist, or accurately states a legal proposition that was overruled two years ago, is not ninety-seven percent accurate in any sense that matters to a practitioner preparing advice. But vendors test for what makes their systems look good, not for the failure modes that matter most to the people using them.

The result is a market where accuracy claims function more like brand messaging than technical specifications. Buyers have no reliable way to compare products, and most don't have the internal capacity to run their own evaluations rigorously. They are making procurement decisions on trust, vendor reputation, and demo performance, which is fine for choosing a project management tool and much less fine for tools that touch legal advice.

This will eventually resolve under pressure from one of two directions. The first is regulatory. Australian courts and regulators have been cautious but attentive, and the duty of competence that practitioners owe clients doesn't disappear because the work was partly done by software. If a practitioner cannot assess the accuracy of a tool they're using, reliance on vendor claims is a thin defence when something goes wrong. The second is litigation. It will take one or two well-publicised cases where legal AI error caused demonstrable client harm for the conversation about verification to become urgent rather than abstract.

The profession shouldn't wait for either of those to arrive before asking harder questions. When a vendor tells you their system is ninety-seven percent accurate, ask: accurate at what? Tested against what corpus? By whom? Over what time window, given that legal databases change? What's the error rate on the task types you actually need it for?

Most vendors will struggle to answer those questions clearly. That discomfort is informative.

The accuracy problem in legal AI is not unsolvable. It is, for now, being avoided. The profession has more leverage than it realises to demand better.

Habeas publishes its methodology for Australian legal research tasks, and practitioners running their own accuracy checks can do so directly on the platform.

Hero image: Brusk Dede on Unsplash

Other blog posts

see all

Experience the Future of Law