Mastodawn

#ML systems can leak confidential data in their training set even with a very silly attack. This is a direct and clear #MLsec issue that applies well beyond the #LLM case

https://www.engadget.com/a-silly-attack-made-chatgpt-reveal-real-phone-numbers-and-email-addresses-200546649.html

A 'silly' attack made ChatGPT reveal real phone numbers and email addresses

It wasn't clear what data OpenAI's chatbot was trained on since the large language models that power it are closed-source — until now.

Engadget

Show thread

Gary McGraw Dec 5, 2023

Interested in these issues? Register for this webinar (in 90 minutes TODAY) --> https://www.iriusrisk.com/iriusrisk-match-webinar-2023

MATCH up your security and compliance efforts.

Join industry experts at the forefront of secure software development. In this engaging session, we'll delve into crucial topics shaping the landscape of digital security today.

Show thread

Patrick Schmitz Dec 3, 2023

@cigitalgem and this is yet another example that you should filter the data going into the model, instead of what is coming out of it. If the data simply isn't there, you can't retrieve it.

Show thread

Gary McGraw Dec 3, 2023

@schmitzel76 I think of it this way. The machine BECOMES the data (plus generalization). So yes, what you are saying is true.

See https://berryvilleiml.com/

Berryville Institute of Machine Learning

Building Security into Machine Learning

Berryville Institute of Machine Learning

Show thread

branewave Dec 3, 2023

@cigitalgem Isn't this information scraped from the open internet? Can't I just google that guy and get the same contact info, posted openly on his company website?

Show thread

Theodrake Dec 3, 2023

@branewave yes but it is now aggregated in a very sophisticated search engine. One that seems to be easy to side step any protective filters.

Also in the area of secrets. One piece of info isn't considered important, but in aggregate might well be. Now days using OSINT methods/tools takes expertise. With LLMs it becomes more of a script kiddie exercise.

Show thread

DanM22 Dec 3, 2023

@cigitalgem so in reading this though, the comment that it spit out data that was already embedded in pages like names, and email addresses. Is it PII if it is open on the internet and given out freely? Not a lawyer here.

Show thread

Gary McGraw Dec 3, 2023

@danmorrill depends on what you mean by open. The immense (like, say 14 trillion data globs) datasets used to train LLM foundation models include lots of information that is protected by copyright, GDPR, and other data regulations.

Did they scrape it from the net? Yes.

Should they use it in training? Being debated.

Show thread

DanM22 Dec 3, 2023

@cigitalgem true point on the ethics of scaping the internet, without the internet we have no LLMs. Truly an interesting debate honestly. I'm more of the "do it" style of this. Without the internet the cloud would not exist, and without the cloud and the internet AI could not exist. Everything depends on each other, and we have no clear guidelines here.

Show thread

Tom Rake Dec 3, 2023

@cigitalgem
This could be a strategic flaw in the LLM black box. Perhaps these types of attacks can be made expensive. But which corporation will bet its future with critical data on a black box blabber mouth?

Show thread

Gary McGraw Dec 3, 2023

@tomrake this is neither surprizing nor new. But lots of organizations are using LLMs built on foundation models riddled with data they probably shouldn't have

Show thread

Robert McNees Dec 3, 2023

@cigitalgem Why are they being trained on data sets with confidential information??

Show thread

Gary McGraw Dec 3, 2023

@mcnees the only way (so far) to get something so huge (14 trillion is huge) is to pile up data from anywhere you can get it. Most people have no idea how big big data are.

Show thread

Hippasus500 aka jwn2 Dec 3, 2023

@cigitalgem @mcnees
For the same reason it’s ok to use confidential information to train ad targeting systems. Oh? That’s not ok?? Hmmmm…

Maybe that’s a problem…