Briefly

TRANSCRIPT

00:00
Adam Stofsky
Hi, Shannon.

00:11
Shannon Yavorsky
Hey, Adam. How's it going?

00:14
Adam Stofsky
Good, very good. Because today we're going to talk about the intersection of two topics that I know are near and dear to your heart, which is AI and personal data and data privacy. I know you've been a privacy lawyer for a number of years, so I wanted to ask you about what is really a pretty huge topic is how do the worlds of data privacy law and sort of customer data and personal data intersect with AI and this kind of coming AI regulation? What are the main issues that come up around use of personal data in the use and creation of generative AI tools specifically?

00:54
Shannon Yavorsky
Yeah, so I think there are really two sides of this. The first is using personal data to build to train a model and lots of issues come up there. So for example, have you provided notice to consumers or to individuals that their data was going to be used to train the model? And there are issues that come up there because it doesn't really mesh with general privacy law. Like privacy law gives consumers rights to like delete their data or correct their data. But once it's been used to train a model, there are like arguments around, you know, their, the technologies to like untrain a model. But it becomes very difficult to extract personal data from a model.

01:42
Shannon Yavorsky
So we have lots of clients who are thinking through like what are the issues that we have to tackle or the is that we have to dot and the T's we have to cross in order to use personal data to train or fine tune a model. And there are lots of sticky ones there around, you know, where the data has come from, what you've said to consumers about how their data is going to be used to make sure that you're sort of complying with the privacy laws.

02:13
Adam Stofsky
So that's for the actual makers of AI models, AI technology. What about everyone else, people who kind of use these models.

02:22
Shannon Yavorsky
So there are lots of issues there too on the other side, where you're a company that's onboarding AI tools and there are tons that are being onboarded right now, like GitHub, Copilot, like text decoding tools, lots of tools for hiring in companies. And so one of the things that all of like a bunch of these AI tools will offer is a data processing agreement where they'll say here's our, you know, here's our master services agreement, but we're also going to give you a dpa, a data processing agreement that will say we won't use any of your data to train our model. The version, and I think this is where people get tripped up a little bit is the versions that are widely available on the Internet, like the consumer version where you just click the click to accept the terms of use.

03:14
Shannon Yavorsky
There's no fee. They're using that data to train their models. The privacy notice says we're going to use any of your input to train our model. Whereas the enterprise agreement that the company is entering into with your open eyes, your anthropics, those are the ones that will offer the data processing agreement to say that anything put into the prompt, for example, is not going to be used to then train the model.

03:40
Adam Stofsky
So is there like a chain of these data processing agreements where I am building a tool that uses anthropic or OpenAI? So we have a DPA where they promise us, hey, we're not going to use any of your customer data to train our model. And then I can go tell my customers, hey, we've been promised that they're not going to use your data to train their model. Is there kind of a chain of these on how it works ultimately to get to the consumer?

04:06
Shannon Yavorsky
Yeah, that's exactly right. You have to flow down those obligations and companies like flow them down into their own master services agreements and their own data processing agreements.

04:17
Adam Stofsky
So as a follow up, this is just a curious question. You learn, you hear about, you know, these big large language models and they say they're trained on the whole Internet and all of human knowledge. Doesn't that include like a lot of personal data that we never consented to allow them to use so much?

04:38
Shannon Yavorsky
So there's publicly available data and this is like a huge question with training AI models with regular data, non personal information as well as personal data. And it was pretty, it's been, you know, to your point, people just scraped the entire Internet as organizations rushed to build the largest large language models. They're only as good as the data they learn from. And that data often contains details that can be linked directly or indirectly to real people. Even something like click stream data or location patterns or chat logs can fall within the definition of personal data under privacy laws. So technically developing or fine tuning an AI model using that data is a regulated data processing activity under applicable legislation. It's kind of one of those things that the regulators have been like, this is happening and we're going to start to look into it.

05:49
Shannon Yavorsky
But the technology is just outpaced the law. And so regulators have had an issue sort of catching up to where we're going with legislation, where we're going with technology.

06:03
Adam Stofsky
So if you're someone who just works at a company that handles data and you use some AI tools, maybe you have some data processing agreements with, you know, with your own customers or with the tools that you use, are there some kind of rules of thumb, some best practices about how to use or not use personal data in these tools? Maybe an extreme example is like, don't like grab your entire customer list and just drop it into a free OpenAI account and then ask it to summarize it. Right? I mean, that's an extreme example. Things like that. Like what are some.

06:38
Shannon Yavorsky
That would be rule number one, don't do that.

06:42
Adam Stofsky
So what are some more subtle best practices?

06:45
Shannon Yavorsky
So some other best practices. I think we go back to core privacy principles. So transparency, telling people in your privacy notice that their data is going to be used to train an AI model, purpose limitation, which is core privacy principles. So data used for one thing can't then be used for another. So if you tell people, we're only going to use this data to provide the service, but then you go on and use it to develop a model, you're arguably outside the parameters of purpose limitation. And then there's data minimization and proportionality. So regulators expect companies to justify why each category of personal data was collected or necessary for something. And more isn't always better. Collecting or retaining unnecessary information creates a lot of risk because that data can be inaccurate. You're, you know, storing it on the system.

07:44
Shannon Yavorsky
And then there's that question that we talked about a little bit, which is the question of individual rights. So access deletion correction. Once personal data is embedded in a model, how do you respond to a deletion request? So it's an area that regulators are actively seeking guidance on because it just doesn't fit very well. A lot of companies are trying to mitigate risk by de identifying data. I get a lot of questions about anonymization right now. And the main reason for that is that companies want to like, take data sets that they already have, anonymize that data and then use it to train their models. So there's so many questions coming up about the right way to de identify data.

08:32
Adam Stofsky
All right, Shannon, thank you so much for this. I'll say very high level overview of the relationship between data protection laws and privacy with generative AI. Thanks so much.

08:43
Shannon Yavorsky
Thanks, Adam.