Results Are In: 5 Lessons from an Independent Study of aiR for Review

Generative AI has generated plenty of conversation in legal circles. Some of it is optimistic. Some of it is skeptical. Much of it is theoretical.

So we decided to test it at Redgrave LLP.

We conducted a head-to-head study comparing Relativity aiR for Review, a generative AI review workflow, against a traditional first-pass managed review workflow using active learning. We chose a deliberately difficult document population involving ~45,000 documents from a real-world public data set and a nuanced responsiveness standard tied to pharmaceutical marketing, controlled substances, and federal compliance obligations.

In other words, documents were not responsive simply because they mentioned opioids, sales activity, or drug promotion. They had to contain evidence related to compliance with, violation of, or reckless disregard of federal requirements. That made the exercise less about finding documents “about” a topic and more about applying judgment to the contents of the document.

A subject-matter expert then conducted a blind review of a random sample to establish the ground truth against which both workflows were measured.

The full report gets into the methodology, statistical analysis, and validation process in detail. But for legal teams thinking about how generative AI may fit into real review workflows, five practical lessons stood out.

1. Judge speed by the quality of the result.

The efficiency difference in the study was striking. The aiR for Review workflow required approximately 18 hours of attorney time, while the active learning managed review workflow required approximately 1,123 hours from a 24-person team over seven business days.

That represents a roughly 98 percent reduction in cumulative human hours.

Time savings at that scale are hard to ignore. But speed alone is not the full story. The more important point is that aiR for Review achieved those efficiency gains while also delivering higher recall and lower elusion in this study.

In other words, this was not simply a faster way to get through documents. It was a faster workflow that found more responsive material and left fewer important documents behind. That combination is what makes the result meaningful. Faster review is valuable, but faster review that also improves completeness is far more valuable.

The business case for generative AI in review is not just about doing the same work faster. It is about helping legal teams move faster without accepting more risk, while freeing attorneys to spend more time on analysis, strategy, and advocacy.

2. Measure what the workflow misses, not just what it finds.

In the study, aiR for Review achieved 88 percent recall, compared with 64 percent for the active learning workflow. Put plainly, aiR for Review found more of the documents the expert ultimately determined to be responsive.

That recall difference is important because missed documents can carry real consequences. In discovery, the document that is never reviewed may be the one that changes case strategy, informs settlement posture, or creates risk if it surfaces later.

Elusion tells a similar story from another angle. aiR for Review’s elusion rate was 1 percent, compared with 3 percent for the active learning workflow. That means three times fewer responsive documents were left behind in the population treated as not responsive.

No review process is perfect, and perfection is not the legal standard. But completeness still matters. A process that leaves fewer responsive documents in the discard pile gives legal teams a stronger foundation for understanding the facts and defending their approach.

3. The most interesting story may be human plus AI.

Finding more responsive documents was only part of the story. The study also showed that aiR for Review could help a human expert reassess documents he had initially coded not responsive.

The subject-matter expert first reviewed the validation sample blind, without seeing aiR for Review’s scores, rationales, or citations. After that review was complete, he revisited a set of documents where he and aiR for Review disagreed: documents he had coded not responsive, but aiR for Review had predicted responsive.

This time, he could see aiR for Review’s rationale and citations. After that second look, of the 151 documents re-reviewed, the expert changed 10 calls to responsive. Put another way, in roughly one out of every 15 of those disagreement documents, aiR for Review’s analysis helped surface something the expert agreed was responsive after further review.

That does not mean the expert deferred to AI. In most instances, he stood by his original decision. But that is exactly what makes the result useful. aiR for Review was not replacing expert judgment; it was focusing attention on specific documents where another look could change the answer.

And in discovery, 10 documents is not necessarily a small number. One document can alter a deposition outline, reshape a timeline, or expose a risk the team had not fully appreciated. The practical value is that aiR for Review uncovered these documents and provided reasoning detailed enough to help an expert reassess close calls.

The value of generative AI, then, is not limited to speed or automation. Used well, it can act as a second lens on the evidence: surfacing documents that deserve closer attention, supporting quality control, and helping legal teams make more confident calls on the documents that matter most.

4. A wider net can be worth it.

aiR for Review also flagged more documents for potential review and had lower precision than the active learning workflow: 29 percent compared with 39 percent. That is a noteworthy tradeoff, and it should not be glossed over.

In this study, aiR for Review cast a wider net. That meant more documents would be routed for potential human review. But that wider net also caught more responsive documents and missed fewer important ones.

The precision numbers also need context. The document population had low richness, meaning responsive documents were relatively rare. In low-richness populations, precision tends to be lower across review workflows because there are simply many more non-responsive documents than responsive ones.

So the question is less, “Did aiR for Review flag more documents?” It did. The better question is, “Was the added review burden justified by the reduction in missed responsive material?” On many high-stakes matters, the answer is yes.

Legal teams should think about generative AI review as a strategic tool, not a magic filter. In some matters, especially those where missing key documents is the greater risk, a broader review set may be a very reasonable tradeoff.

5. Testing AI on nuanced legal issues is crucial.

These results are meaningful in large part because the review was difficult by design.

Many benchmarks are built around topic identification: Can the system find documents discussing a certain subject? That is useful, but it is not always how real discovery works. Legal review often requires more than recognizing a topic. Reviewers must apply a protocol, understand context, distinguish between related and responsive, and make calls in the gray areas.

That is where this study becomes especially interesting. The review standard required both human reviewers and aiR for Review to evaluate documents against a complex legal rule set. A document could mention sales activity and still be non-responsive. It could mention compliance only in passing and still require closer analysis.

Legal teams do not need AI that only performs well on easy questions. They need tools that can support real matters, where the issues are nuanced, the stakes are high, and the “right” answer may depend on more than a keyword hit.

What this Means for Legal Teams

This study is not the final word on generative AI review, and it should not be read as a universal benchmark for every matter. Review quality will always depend on the data, protocol, workflow, validation design, and the judgment of the people overseeing the process.

But the results are meaningful.

In a difficult, low-richness review requiring application of a nuanced legal standard, aiR for Review found more responsive documents, left fewer behind, and helped an expert identify responsive material he had initially missed. It did so while requiring dramatically less human review time than the managed review workflow tested in the study.

The study does not suggest that generative AI replaces traditional review strategy or human judgment. It shows something more practical: when properly prompted, validated, and integrated into a defensible workflow, it can help teams improve recall, reduce risk, and strengthen the way attorneys apply legal judgment.

For the full methodology, statistical analysis, and detailed findings, read the complete report here.

Graphics for this article were created by Caroline Patterson.

Relativity aiR for Review Accuracy Study

Putting Gen AI to the Test: A Document Review Accuracy Study

This study used a low-richness data set requiring a nuanced legal standard to evaluate a generative AI review workflow against a managed review using active learning. The results show that generative AI can meaningfully improve both efficiency and review completeness – while supporting human decision-making in complex matters.

SEE THE STUDY

Robert Keeling is a co-managing partner of Redgrave LLP and a nationally recognized authority on e-discovery. He serves as discovery counsel in complex, data-intensive matters, including litigation, investigations, and regulatory reviews.