#ML systems can leak confidential data in their training set even with a very silly attack. This is a direct and clear #MLsec issue that applies well beyond the #LLM case

https://www.engadget.com/a-silly-attack-made-chatgpt-reveal-real-phone-numbers-and-email-addresses-200546649.html

A 'silly' attack made ChatGPT reveal real phone numbers and email addresses

It wasn't clear what data OpenAI's chatbot was trained on since the large language models that power it are closed-source — until now.

Engadget
Interested in these issues? Register for this webinar (in 90 minutes TODAY) --> https://www.iriusrisk.com/iriusrisk-match-webinar-2023
MATCH up your security and compliance efforts.

Join industry experts at the forefront of secure software development. In this engaging session, we'll delve into crucial topics shaping the landscape of digital security today.

@cigitalgem and this is yet another example that you should filter the data going into the model, instead of what is coming out of it. If the data simply isn't there, you can't retrieve it.

@schmitzel76 I think of it this way. The machine BECOMES the data (plus generalization). So yes, what you are saying is true.

See https://berryvilleiml.com/

Berryville Institute of Machine Learning

Building Security into Machine Learning

Berryville Institute of Machine Learning
@cigitalgem Isn't this information scraped from the open internet? Can't I just google that guy and get the same contact info, posted openly on his company website?

@branewave yes but it is now aggregated in a very sophisticated search engine. One that seems to be easy to side step any protective filters.

Also in the area of secrets. One piece of info isn't considered important, but in aggregate might well be. Now days using OSINT methods/tools takes expertise. With LLMs it becomes more of a script kiddie exercise.

@cigitalgem so in reading this though, the comment that it spit out data that was already embedded in pages like names, and email addresses. Is it PII if it is open on the internet and given out freely? Not a lawyer here.

@danmorrill depends on what you mean by open. The immense (like, say 14 trillion data globs) datasets used to train LLM foundation models include lots of information that is protected by copyright, GDPR, and other data regulations.

Did they scrape it from the net? Yes.

Should they use it in training? Being debated.

@cigitalgem true point on the ethics of scaping the internet, without the internet we have no LLMs. Truly an interesting debate honestly. I'm more of the "do it" style of this. Without the internet the cloud would not exist, and without the cloud and the internet AI could not exist. Everything depends on each other, and we have no clear guidelines here.
@cigitalgem
This could be a strategic flaw in the LLM black box. Perhaps these types of attacks can be made expensive. But which corporation will bet its future with critical data on a black box blabber mouth?
@tomrake this is neither surprizing nor new. But lots of organizations are using LLMs built on foundation models riddled with data they probably shouldn't have
@cigitalgem Why are they being trained on data sets with confidential information??
@mcnees the only way (so far) to get something so huge (14 trillion is huge) is to pile up data from anywhere you can get it. Most people have no idea how big big data are.

@cigitalgem @mcnees
For the same reason it’s ok to use confidential information to train ad targeting systems. Oh? That’s not ok?? Hmmmm…

Maybe that’s a problem…