Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

404 Media
And just the other day I had people arguing to me that it simply wasn’t possible for ChatGPT to contain significant portions of copyrighted work in its database.
Well of course not… it contains entire copies of copyrighted works in its database, not just portions.

The important distinction is that this “database” would be the training data, which it only has access to during training. It does not have access once it is actually deployed and running.

It is easy to think of it like a human taking a test. You are allowed to read your textbooks as much as you want while you study, but once you actually start the test you can only go off of what you remember. If would require you to have a perfectly photographic memory (or in the case of ChatGPT, terabytes upon terabytes of RAM) to be able to perfectly remember the entirety of your textbooks.

ChatGPT is a large language model. The model contains word relationships - a nebulous collection of rules for stringing word together. The model does not contain information. In order for ChatGPT to answer flexibly answer questions, it must have access to information for reference - information that it can index, tag and sort for keywords.
I’m honestly not sure what you’re trying to say here. If by “it must have access to information for reference” you mean while it is running, it doesn’t. Like I said that information is only available during training.

Like I said that information is only available during training.

This is not correct. I understand how neural networks function, I also understand that the neural network is not a complete system in itself. In order to be useful, the model is connected to other things, including a source of reference information. For instance, earlier this year ChatGPT was connected to the internet so that it could respond to queries with more up-to-date information. At that point, the neural network was frozen. It was not being actively trained on the internet, it was just connected to it for the sake of completing search queries.