As an Elder Millennial, there’s a clear demarcation in my childhood: Pre-AOL Instant Messenger and Post-AOL Instant Messenger.
In the Pre-AIM era, when I first got access to the internet, my dad—who was an early adopter of all things interwebs due to his work in the US Navy—told me I was allowed to go into chat rooms (if you’re doing the math, I was indeed way too young for this) if I followed some very simple rules:
- Never tell anyone your full name.
- Never tell anyone your address.
- Never tell anyone your phone number.
- Never share anything that could reveal where someone could find you.
The punishment of breaking any rule would be losing access to my silver and pink officially licensed Barbie desktop computer that I had recently gotten for Christmas. The stakes had never been higher.
I never broke those rules (note: my dad would disagree, but I stand by that one incident being a mistake) because fundamentally my parents were trying to give me access to life-changing technology while building guardrails to keep me and others safe. In the many (oh man, so many) years since, technology has accelerated at such a breakneck pace, and our use of Google, online records management systems, and social media platforms like Facebook, Instagram, and TikTok, has become so ubiquitous that those core internet safety rules almost seem quaint.
Indeed, today, even if a user isn’t publicly sharing their location, occupation, age, or other private data, building a profile of that person isn’t all that hard with some Google searching. I won’t go into details on how to accomplish that, lest I make a primer on stalking, but let’s just say unless the person in question lives off-the-grid, is a luddite, or is otherwise very privacy- and security-aware, it’s not that hard. It’s not even all that time consuming.
This is how advertising companies are so successful with targeted ads. They use your cookies (internet breadcrumbs) and publicly available information, and make a profile of you, to figure out what they can sell you. Sometimes it simply feels like your phone can hear you, the ads are so targeted—and while that’s not entirely false, it comes down to the fact that advertising companies are just that good at using the data they’re getting from Google searches (remember: Google is not an academic tool but an advertising tool), websites you’ve visited, conversations you’ve had in the clear on Facebook, and other sources of data to, within minutes, advertise that printer you’ve been considering and were just chatting about with your friend.
What Your Disparate Data Says About You
The data points that go into making those profiles are called “disparate data.” Independently, they don’t rise to the level of a data privacy violation if they’re collected (at least in the US), but when you put them all together, they tell you a lot about a person. The EU understood this danger early on, which is why if you look at the list of protected data under GDPR it is not only extensive, but on the surface it can seem bizarre. For example, philosophical beliefs are protected under the GDPR, which I’m sure Camus would find hilarious (I dedicate that joke to my high school philosophy teacher, Mr. Weaver; thanks for ruining me for life).
They got it right though. If I was telling you about a white man born November 7, 1913 in Algeria, who believed in absurdism, you’d either know I was talking about Albert Camus, or a quick Google search would reveal that. We didn’t need his name to identify him, and we could even have truncated the facts about him further and still figured it out.
Turns out OpenAI’s GPT and other LLMs are very good at inferring personal information about a person based on disparate data. Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev from SRILab in Zurich launched www.LLM-Privacy.org to demonstrate their findings on this subject. On their website, you can go head-to-head with LLMs, trying to answer questions based on prompts generated with keywords that offer clues to the truth. In the example below, “Yebo” is Zulu for “yes” and strongly indicates that the speaker is in South Africa. The text was taken from Reddit comments, so researchers were able to identify the correct answer and compare them to LLM results.
They found that LLMs were correct 85-95 percent of the time.
Given that people are heavily adopting AI chatbots, it begs the question: what are the robots learning about us, and how is that data being used? An obvious answer to the latter question is advertising, as we already covered, but there’s a real opportunity for misuse. Threat actors could leverage innocent-looking chatbots to extract sensitive information.
Take, for example, the 20 questions/survey games that sometimes go viral and are played on social media, where the user fills out a series of questions and supposedly learns more about themselves from the app—what Muppet are they (Team Gonzo in the house), perhaps, or the 90s sitcom they’d like best. Peppered in the survey are usually carefully crafted questions that will uncover someone’s challenge question answers for password resets. This used to be a common form of social engineering, at least until the wide-spread adoption of multi-factor authentication.
Chatbots could be the newest form of social engineering. Imagine you’re using a chatbot to troubleshoot a login issue, but the chat has been intercepted by threat actors, exposing passwords and sensitive information. Or worse, you think you’re engaging with a government representative and sharing highly sensitive data—which is actually falling into the hands of a state actor, a ransomware gang, or some other threat actor who can use that information for nefarious deeds.
How Regulatory Protections are Evolving
Obviously, internet safety rules have changed drastically since the days of my Barbie desktop computer—and LLMs may pose one of the bigger threats for individual and consumer rights in the here and now. Unfortunately, the United States is at a particular disadvantage to combat this issue due to the fractured state of the current regulatory landscape.
Data privacy advocates in the United States have been working toward comprehensive privacy legislation since the late 1990s. Unlike some other regions, such as the European Union with its General Data Protection Regulation (GDPR), the US lacks a single, overarching law to protect individuals' privacy rights. Right now, over 55 state and federal laws coexist in the United States, offering various levels of privacy protections. Not only is it a nightmare for data breach response and notification, but the inconsistencies do Americans a disservice when it comes to adequately protecting data privacy as it leaves gaps in protection for individuals whose data may be handled differently depending on their location.
Another major issue is the limited enforcement powers of regulatory bodies. While entities like the Federal Trade Commission (FTC) play a role in overseeing and penalizing unfair or deceptive practices, their authority is constrained, and fines may not be sufficient deterrents for some companies. The absence of stringent consequences for non-compliance can undermine the effectiveness of existing regulations.
The release of the “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,” by the Biden administration underscores the importance of legislation that unifies the existing patchwork of regulations, enforcement activities, and penalties under one comprehensive law. As the White House stated in their fact sheet, "AI not only makes it easier to extract, identify, and exploit personal data, but it also heightens incentives to do so because companies use data to train AI systems." In the EO, “…the President called on Congress to pass bipartisan data privacy legislation to protect all Americans, especially kids.” This is a critical step in data privacy, as most experts agree that without a federal, unified data privacy law, the ability to pass comprehensive legislation and regulation for AI in the United States will be significantly hampered.
In June 2022, the House Energy and Commerce Committee introduced H.R.8152—the American Data Privacy and Protection Act—which moved through committee with a vote of 53-2. However, they were not able to ratify H.R.8152 before the 118th Congress was sworn in, and per Cathy McMorris Rodgers R-Wash., chairwoman of the full committee, review by the 118th Congress is required before moving forward.
According to Rodgers, “We have to move on from this broken regime of notice and consent to one that establishes baseline safeguards for consumers’ information, clear rules of the road for businesses, and meaningful enforcement of the law. … This must include specific protections for sensitive information and protections for civil rights. The bipartisan American Data Privacy and Protection Act is the place to start.”
Despite the pressure from the White House, with the 2024 election looming, the odds are that the ADDA won’t be passed this year. But the growing public awareness of the pitfalls of AI, the lawsuits against LLM creators, and the implementation of guidance and policies within the federal government (note: it’s unclear yet how the Supreme Court’s ruling on the Chevron deference will impact these policies), following the EO, means there’s still a chance 2024 could be the year.
Several states aren’t waiting around. California, Florida, and New York all introduced laws in the past few months that would regulate AI in their states, paving the way for other states to introduce their own legislation. Without action from Congress, AI regulation will be equally as fractured as data privacy. As AI and data privacy are inextricably intertwined, it will make the job of e-discovery and data breach professionals even more difficult—and ultimately more costly and time consuming. It would be a cruel twist of irony if the regulatory landscape is what unravels the greatest benefit to AI: reducing tedium for humanity.
Graphics for this article were created by Natalie Andrews.