On January 1, 2023, requirements under the California Privacy Rights Act (CPRA) took effect. The CPRA was an expansion of the California Consumer Privacy Act (CCPA) of 2018, the first-ever comprehensive privacy legislation passed in the United States.
Even though the CPRA was an amendment to the CCPA, its rules and requirements were materially more significant than its predecessor regulation. Some key differences included a greater focus on data minimization and the inclusion of a new sub-category of regulated information: sensitive personal information (SPI) which covered additional types of information such as a consumer’s religious beliefs, genetic data, and, in certain circumstances, the contents of a consumer’s email and text communications.
A Profusion of Privacy Laws: Regulatory Zeal and Higher Expectations from Consumers
Despite being limited to residents of California, CPRA was received as the standard-bearer of privacy rights—the American equivalent to the EU’s General Data Protection Regulation (GDPR). Moreover, the landmark privacy law came on the heels of Apple’s 2021 privacy update—another milestone development which sparked a new cultural conversation on the importance of consumers’ data privacy.
Since then, close to a dozen other states—such as Ohio, Indiana, Texas, and Iowa, to name a few—have passed similar comprehensive privacy acts. And internationally, the last couple of years have brought increased regulatory momentum with various privacy laws being enacted around the world: Brazil’s LGPD, China’s Personal Information Protection Law (PIPL) and India’s Digital Personal Data Protection Act 2023.
Indeed, as consumers demand more commitments from organizations in safeguarding their data—and as the profusion of new privacy laws prompts greater regulatory zeal over missteps by businesses—legal, compliance, and privacy professionals are now under increasing pressure to stay on top of privacy risks in their organization's data pool.
Data Explosion: Challenges Presented by the Velocity, Volume, and Variety of Data
Furthermore, in the backdrop of more stringent and exacting privacy laws, organizations are dealing with more data than ever before. Confronted with the growing velocity, variety, and volume of data, organizations today may have more data than they can keep track of.
The data explosion is an irreversible trend with no respite: according to Statista’s projections, the amount of new data created in 2025 will reach 180 zettabytes (up from 97 zettabytes in 2022). Additionally, the surge in the adoption of online communication tools in the era of remote work is further driving data volumes up and to the right. Adding additional complications, the complex, chaotic, and commingled nature of data is making it increasingly difficult for organizations to mindfully manage sensitive data.
For instance, personally PII, SPI, and personal health information (PHI) are often buried in large volumes of unstructured data, including communications, and can be extremely hard to identify and de-risk within the morass of general enterprise data.
Under these conditions and constraints, locating sensitive information such as PII, SPI, and PHI lying hidden in voluminous data presents obvious challenges. Indeed, whether it’s for document production in a legal matter, retrieving an individual’s information in response to a Data Subject Access Request (DSAR), or anonymizing PII for transferring into a different regulatory jurisdiction, finding and securing sensitive personal information has never been more difficult—even as the stakes have never been higher.
Traditional Approaches Lack Scalability & Accuracy
Prior to the digital age, redaction meant inking over or cutting out information from a document, and then scanning the documents back into computer systems—a process which, although woefully inefficient, is still practiced by some even today.
More modern approaches involve the use of redaction software that provides a user interface for searching through documents and making redactions. While we can look for keywords or even black out every instance of specific words or phrases, this approach has some challenging drawbacks: not only does it take a lot of time to uncover and redact sensitive information hiding in tens of thousands of emails, it’s also easy to overlook instances of sensitive information that need redaction. For example, a search for the keyword “address” may miss typos like “adress” or “addres” made during data entry, or it could miss the one or more odd phone numbers stored in a hyphenated format amongst a sea of other numbers that are stored in the standard parenthetical: for example, 123-456-7890 instead of (123) 456-7890.
The simple fact is that human-generated data can be far too unstructured and noisy for rule-based programs to handle alone. Organizations and law firms may therefore feel left with a choice of one of two available options: using helpful but sometimes imprecise software to complete the job, or bearing the massive costs—not to mention human errors—associated with hiring armies of contract reviewers who can perform manual redactions.
But it doesn’t have to be that way. Adding AI to the toolbox can help in a big way.
Using AI to Identify and Redact PI with Precision and at Scale
Unstructured data, such as emails, text documents, instant messages, photos, audio files, and other types of unlabeled data can’t be accurately analyzed using rule-based applications.
Recent advances in artificial intelligence have demonstrated the technology’s untapped potential in analyzing massive volumes of data—particularly unstructured information. For computationally intensive tasks involving voluminous amounts of data, AI consistently outperforms humans on accuracy and minimizes the risks associated with the element of human error.
To learn more about how AI can be used to automate the identification and redaction of PI, check out our white paper: "AI for PI: Find and Redact Personal Information with AI-Powered Workflows."
Graphics for this article were created by Kael Rose.