Does causing harm require intentionality? Recklessness? Sentience? Explore the concept of algorithmic bias and how sentiment analysis in RelativityOne was engineered with responsible AI in mind.
AI offers great promise to applications across industries and communities. But it is far from perfect. At the root of that imperfection is human intelligence and human data, which develop and train AI models. Flaws in thinking or asymmetries in information can have a compounding effect. At risk is the real harm that can be inflicted on real people—without, perhaps, a really responsible party—when flawed AI applications are activated.
“The more complicated an AI algorithm is, the more likely we are to see behaviors that are unexpected, particularly when it fails. When it fails in such a way that marginalized groups of people are treated differently by a model, we call this ‘algorithmic bias’ or ‘model bias,’” Brittany Roush, a senior product manager at Relativity, told guests at Relativity Fest in 2022.
“We often point to the model as the problem, but the algorithm isn’t itself racist, or sexist, or any other ‘ist’ you can think of,” she continued. “It is just built on billions of words stemming from humans, and without proper context and management of language, the model can make mistakes.”
Bias in AI-enabled tools can affect the real-life outcomes influenced by technology. In circumstances where those outcomes are of relatively lesser risk—you’ll probably survive if Netflix insists you rewatch "Emily in Paris"—that bias may cause minimal harm. But in other applications, including those in the legal realm, algorithmic bias could have devastating consequences.
Brittany and her colleagues had those consequences in mind during development of sentiment analysis for RelativityOne. The feature, a top request from users of Relativity, enables legal teams to “look for emotional indicators—negative, positive, anger, grief—in a pile of data to find what matters quickly and see how it can inform an investigation or case strategy,” Brittany explained to us.
It sounds like a straightforward use case for AI; like a satisfying game of whack-a-mole featuring sassy Slacks and irascible emails.
But emotions are fickle things. When they’re spread across mountains of data with little context, they’re difficult to discern—and leaning on improperly tuned AI to do it can have severe consequences.
For case teams confronting thousands—even millions—of pages of data during the discovery phase of litigation or an investigation, getting to the crux of the matter as quickly as possible is essential. Deadlines and budgets are tight. A lot of money and a lot of stress are at stake.
Unfortunately, the solution isn’t as simple as plugging “bad guys discussing fraud” into a search bar and bringing the spiciest results to court to shock a judge and jury. e-Discovery is much more complicated than that, requiring the comprehensive review and disclosure of all documents deemed relevant to each matter. The goal is to build an intimate knowledge of the information they contain, craft a robust case strategy, and exchange evidence with opposing counsel to ensure both adversaries have equal access to relevant information and can craft arguments accordingly.
Getting all this done in a manner that enables the “just, speedy, and inexpensive determination of every action and proceeding” (that’s Rule 1 of the Federal Rules of Civil Procedure), using defensible workflows that will withstand judicial scrutiny, is ethically and legally essential.
In the Before Times, this entailed lugging 40-pound bankers’ boxes out from company storage and sorting through paper documents, one memo at a time.
But today, those moldering crags of bankers’ boxes have been subsumed by the orogeny of Mount Data. And the summit keeps growing farther out of reach.
How big is this digital Alp? Consider that a report from the University of California at Berkeley entitled “How Much Information? 2003” estimated the sum total of spoken language from across human history, up to 2002, amounted to about five exabytes. For scale, one exabyte is about one billion gigabytes; one gigabyte could equal one standard definition movie.
Ten years later, the internet’s favorite techno-cartoon series, xkcd, estimated that Google alone already had 10 exabytes of active storage on its servers. If we were to pretend that was all 100-minute movies, it would take two million years to watch them all. A tall order for a species that has existed for only about 300,000 years.
And now it’s been another ten years, and the exabytes are piling up.
So, yes; data volumes are flabbergasting. But that’s not the only problem mucking up today’s discovery projects. Each of Oracle’s three Vs of big data—volume, variety, and velocity—complicate a review team’s search for the truth. Huge swaths of data can slow them down. But so do new, unexpected data types and the ever-accelerating pace of change when it comes to employees’ data-generating habits.
In particular, the rise of short message data—those real-time digital conversations conducted via text, Slack, Microsoft Teams, and other chat platforms—poses particularly acute challenges for legal teams and investigators conducting document reviews.
For Ivan Alfaro, product manager in AI services at Relativity, the short summary of those challenges is: “e-Discovery professionals need to collect and organize communications from the same custodian that took place across different platforms and were stored in multiple sources and formats.”
Considering that the goal of a document review for e-discovery is to take a holistic look at a corpus of data to understand its relevance to the matter at hand—what happened, who was involved, whether wrongdoing exists—this proliferation and disparity of data types and sources presents downright oppressive complexities. And there is a growing sense of urgency in unraveling them. Short message data is exploding in prevalence; RelativityOne has seen a 300 percent increase in the amount of short message data coming into the platform year over year.
“We also forecast that around 2024, the amount of short message data stored in RelativityOne will surpass the volume we currently have in emails,” Ivan told me.
Mountainous (and growing) volumes of data, spread across proliferating formats and sources, with distinct differences in tone and conversational styles, on a very tight deadline … Not to mention you then have to interpret it all.
Take a moment to imagine your own communication habits. What mental calculations do you make as you prepare to contact a colleague or client with a question or request? What equation do you use to determine which platform or tone is best—and how you can escalate or deescalate conversations already in progress?
Do you speak to your boss the same way you would a peer?
Is your writing style the same in an email to a client as in a comment on Instagram?
Probably (hopefully?) not.
Do you choose which communication platform you’ll use to contact a colleague depending on your subject matter—say, whether it’s a serious topic to discuss, a complaint to file, or praise to share? Will you switch platforms if a conversation takes an unexpected turn, or you need to get to a resolution more quickly? Have you (be honest) ever sent a very finely written email and then immediately chatted the recipient to confirm receipt of said Very Important Email? (Guilty over here.)
Case teams know that short message data can host the more candid, more emotive, conversations between coworkers, so it seems logical to give them attention sooner, right? People are more likely to belie their less-than-ideal behavior in an informal Slack conversation than they would be to send an email about it.
Contemporary e-discovery software helps review teams rapidly flag that type of conversation for closer examination. In platforms like RelativityOne, features like technology-assisted review (TAR) have been on the scene for more than a decade, and have become essential in reducing the time it takes to cut through the noise and flag those records which are most likely to reveal the story in question. These early steps of data analysis might take review projects from a million documents down to a few thousand.
But that’s still a large number. Then what?
Additional artificial intelligence tools promise to create further efficiencies for case teams whittling down those thousands into the lucky few exhibits that matter. This is where sentiment analysis comes into play.
Did you know there are more than 200 recognized breeds of horse around the world? Three hundred breeds of dog? One thousand breeds of cattle? Many of these creatures were bred for agricultural purposes, others for companionship; some are hyper-specialized for jobs like aquatic or avalanche rescue, pulling heavy loads through the dank corridors of a coal mine, or waging war.
Though it’s tough to pin down a precise number of AI models in use today, it’s safe to say the number is much bigger than any of the above. And while any new model is in development, responsible data scientists will continuously evaluate its “fit for purpose.” This guides their selection of the data and inputs they use to train and refine it. Much like a professional breeder selects only the healthiest dogs with the most desirable characteristics to produce the next generation of pups, these development teams carefully engineer their algorithms with their intended use in mind—making data selections and training decisions accordingly.
During this fine-tuning, developers must ask: Is the model being trained specifically for the discrete application the target user intends to implement? Do its strengths, weaknesses, and data inputs reflect that use case?
Aron Ahmadia, Relativity’s senior director of applied science, considers this thought process an essential step for ethical model building in the world of AI.
When an AI model is not fit for the purpose to which a user intends to apply it, that’s known as an alignment problem.
In such cases, a model may have been tested and evaluated as good or even great for one use case, but when applied to another, it falls woefully short of minimum viability.
“Alignment problems come down to users having a view of what makes a model ‘good’—and the person who built it having a different view,” Aron explained.
This is a high-risk problem in the context of sentiment analysis for legal use cases because algorithmic bias introduced by an alignment problem means a model could surface different pieces of digital evidence than a user may need or expect.
In the earliest iteration of building sentiment analysis for RelativityOne, the team leveraged a ready-made, commercial AI model that was built and trained by a third-party vendor. This is common practice; many software providers integrate such models into their platforms. That’s what the commercial models are for; why reinvent the wheel every time?
Models like these are often trained on large data sets from online sources such as Twitter or product review archives. This is, in part, because they represent a huge variety of sentiments, subject matter, and speakers; the hope is that all that variation helps “educate” the AI by the sheer scale of information.
When Aron joined the team on the tail end of development for that first sentiment analysis prototype in RelativityOne’s sandbox, he opted to test the model for bias in the specific context of sentiment analysis for e-discovery and investigations. And what he discovered gave him pause.
“Initially, I thought it would be okay to use a commercial model, but was adamant that we should measure the model for bias before releasing,” he recounted. “As it turned out, we got uncomfortable with how much we found in that testing. And ultimately, that realization started more conversations—many of them uncomfortable—around what it would mean to tell customers we weren’t quite ready to release this feature yet after all.”
And thus what should have been the eve of its launch into the real world became more of a “dark night of the soul” for sentiment analysis in RelativityOne.
On the plus side, Aron said, customers quickly got on board with the change in plans upon learning what was at stake.
“It was easy for us to see why we had to say ‘no’ to that first model, but then we had to share that decision with other people—including customers,” Aron said. “Fortunately, when we explained what drove the decision,
Our customers affirmed us quickly: ‘Please don’t give us a model that isn’t ready.’
Unwilling to end up with the same results by simply using a different commercial model—and knowing that those models were typically built for much different use cases (think marketing applications or social media monitoring)—the team opted to build their own model.
An entirely new, fit-for-purpose AI model. For the legal discovery process. From scratch.
Crafting an AI algorithm is a mixed-media artform—a mash of computer science, mathematics, linguistics, statistics, engineering, coding, data science, and all manner of nuance.
“I thought it would take years, reinventing AI to do what we wanted to do,” Brittany, the product manager, told me.
To prevent the next iteration from recreating the same problem, the team needed to uncover why bias existed in the first prototype of RelativityOne’s sentiment analysis engine. With that diagnosis in hand, they could dissect the problem and compensate for it in building the two-point-oh solution.
So the Relativity team reached out to beta users—as well as their own in-house data scientists—to examine the prototype’s biased outputs. This exercise revealed the heart of the issue: biased inputs.
In short, building a new algorithm to minimize bias required digging deeply into a key library of terms to educate the new model on their unbiased meanings—meanings that weren’t loaded by the way the words were used by trolls and Twitterers.
“The approach we took was a lot more manual than most people would think of data science work. A lot of people think we pour a lot of data in, swirl it around a little bit, and after a few weeks, we have a model,” Aron explained when asked what this process looked like. “But we got very aggressive, looking through the model’s vocabulary word by word to see where it rated negative and positive sentiment.”
Although the team on this project consisted of data scientists, product managers, user experience experts, and many others, they determined that they needed more voices in on this process to ensure bias was eliminated as holistically as possible.
“Even after we carefully curated inputs, we still found examples of the model trying to learn behaviors we didn’t think were appropriate,” Aron said. “So we took one more pass, engaging with as many voices as we could—diverse members of [our] customer community and Relativity—to help us identify which terms we wanted in and which we didn’t.”
Using “the data scientist’s best friend”—an Excel spreadsheet—Aron, Brittany, and the rest of the team collected 6,000 terms of note for further examination. They then recruited members of Relativity’s Community Resource Groups to help evaluate them for neutral, positive, or negative sentiment.
These volunteers brought diverse viewpoints to help ensure terms related to protected classes like race, nationality, sexual orientation, and religion were tagged and included—or excluded—appropriately for training the algorithm.
“This group had to go through and look at each word, imagine how it could be used in context and what that sentiment would sound like, and evaluate accordingly. It wasn’t easy,” Aron explained. “You end up with difficult words, by nature, and we left many of those in because they truly could indicate negative sentiment. But others are less clear, and some involve protected groups like race, nationality, sexual identity—and those were important for the model not to judge to determine sentiment.”
Roshanak Omrani, a senior data scientist at Relativity, explained that many use cases for sentiment analysis are intended to “quantify the overall polarity of text with respect to a product or service—similar to the mainstream application it has found in review and ranking platforms.”
Think of when you’re shopping on Amazon for, say, a new waffle maker. If you check the reviews from other consumers, you’ll be given a neatly organized list of positive, neutral, and negative reviews; the page may also display common themes from those reviews, such as “temperature control.”
In legal applications, however, “rather than aiming to infer the overall direction of valence in a set of documents, our model is intended to pinpoint polarity on individual documents,” Roshanak continued.
So whereas Amazon uses sentiment analysis to pull overarching themes from reviews and determine how they collectively speak to the quality of a known product, an e-discovery team uses sentiment analysis to identify and prioritize discrete documents as they pertain to a search for a truth.
If an AI model is biased in some of its outputs, many non-legal use cases for sentiment analysis are somewhat protected from that bias by the law of averages. This is because the insights gleaned from the AI’s determinations, in those circumstances, are interpreted as indicative of overall trends in a large amount of data.
But in legal applications requiring per-document, or even per-sentence, decisions on sentiment, mistakes can have significant impacts on the final results of a review.
“A sentiment analysis model may up-rank or down-rank certain pieces of data because of mentions of certain characteristics—personal characteristics—in a document,” Roshanak explained. “And this is mainly because of the societal bias that the model observes in training data, where the same negative sentiment can be suppressed or amplified with respect to one individual or a certain group of individuals compared to others.”
That matters not just because of the inefficiencies introduced when the system surfaces unhelpful documents during a review, but because the lives of real people can be impacted by what’s uncovered—or left hidden—during the discovery process. Whether the wrong data is captured, the smoking gun is missed, or the documents served to reviewers first set a tone that introduces bias into their case strategy and decision-making, the downstream effects can be devastating.
Believing that commercially available models had not been trained with this nuance in mind, the Relativity team did the work of parsing through those 6,000 potentially biased terms piece by piece.
“After we went through that process, we weren’t done. We still had to measure the model,” Aron said. “So we took a data set released by researchers for this purpose, and ran it through our model using a technique called input verification—essentially, testing whether the model changed its mind based on key identifiers.”
The answer, he was proud to share, “was basically no. We’d seen models with changes of up to 80 percent based on protected classes; ours had almost no changes whatsoever. This is the model that’s powering sentiment analysis in RelativityOne today.”
Artificial intelligence, like human intelligence, is continuously evolving. And while the inevitability of change can inspire a certain uneasiness, it’s also a critical advantage of the technology. Static thinking, after all, will never be conducive to learning.
“The process of building a model is often framed as model lifecycle, which is not a linear but an iterative effort,” Roshanak told me. “An already completed element may need to be revisited based on its role in raising issues or producing particular outcomes in subsequent stages.”
Over time, data changes. Use cases evolve. AI itself develops to keep up. A good model must be fine-tuned accordingly to remain fit for purpose.
The team continues to work on innovating and improving sentiment analysis in RelativityOne, with hopes of adding support for more languages, data types, and emoji translation in the future.
Perhaps most importantly, the model undergoes twice-annual bias checks to ensure outputs maintain quality as the engine gets more use, and as customer expectations, inputs, experiences, and cultures change.
At Relativity Fest 2022, Dr. Timnit Gebru—a highly regarded voice on ethical AI, prominent researcher of algorithmic bias, and the founder and executive director of the Distributed AI Research Institute—gave attendees a reality check during her closing keynote.
“AI is not the Terminator. It’s not a sentient thing that you’re talking to—it’s an artifact, a tool, built by people and impacting people,” she said. “It can be controlled by us as people. So we must develop and deploy it responsibly.”
How do you leverage AI in your everyday work? If you’re using it in the context of document review, do you take its outputs as Unimpeachable Truth? Or do you seek to understand how it arrived at those outputs and why that matters to your work and the people impacted by it?
“My advice to non-technology experts is to identify the scenarios that may cause harm to people. They know their use case best; they know who will be affected by their work. So they are the ones who can identify such scenarios best,” Roshanak advises our readers. “Then, they should approach a technology provider and put the question forward: Is there a potential that the technology they are providing can create such scenarios? If so, keep the technology provider accountable for providing or helping to implement preventative and corrective measures.”
In vetting and deploying technology, the pursuit of efficiency and accuracy is key—but it isn’t enough.
“It’s important that technology providers and users be on the same page regarding the use case of a product. A product might not be harmful by itself if it’s used for its intended use case,” Roshanak reminded us. “But if it’s adopted for a use case which has not been foreseen throughout the development phase, it can cause harm and introduce risk.”
Responsibility is a conversation. Accountability requires partnership. Technology providers and customers should pursue, together, a thorough understanding of a tool’s fit for purpose before, during, and after its deployment.
And until the roads toward clearer standards on model transparency, the elimination of bias, and more discerning definitions for the purposes driving AI development are fully paved, the cautious application of emerging technology—especially at a systemic level—is crucial for minimizing the potential for harm caused by algorithmic bias. So is holding developers accountable for their work upstream of each application.
When complex, human systems use technology to power the pursuit of justice, as we do in the legal community, we are all responsible for ensuring equitable, fair outcomes. Justice, equity, fairness—these are the motivations for every process that software developers, lawyers, admins, and the courts undertake to support investigators, monitor legal matters, and bring the truth to light.
There should be no harm in that.
Many individuals and teams at Relativity contributed to the development of this story. Special thanks to Adam Childs, Aron Ahmadia, Brittany Roush, Dilan Dubey, Ivan Alfaro, Nathan Reff, Roshanak Omrani, Sarah Green, Blair Cohen, JC Steinbrunner, Josh McCausland, Kael Rose, Kasia Lewandowska, Max Barlow, Michael Tomasino, Tim Musho, and many others who played a part in building the technology and weaving this article together. Shout out, also, to DALL-E 2, for generating the comedy/tragedy masks used throughout.