@schmitzel76 I think of it this way. The machine BECOMES the data (plus generalization). So yes, what you are saying is true.
@branewave yes but it is now aggregated in a very sophisticated search engine. One that seems to be easy to side step any protective filters.
Also in the area of secrets. One piece of info isn't considered important, but in aggregate might well be. Now days using OSINT methods/tools takes expertise. With LLMs it becomes more of a script kiddie exercise.
@danmorrill depends on what you mean by open. The immense (like, say 14 trillion data globs) datasets used to train LLM foundation models include lots of information that is protected by copyright, GDPR, and other data regulations.
Did they scrape it from the net? Yes.
Should they use it in training? Being debated.
@cigitalgem @mcnees
For the same reason it’s ok to use confidential information to train ad targeting systems. Oh? That’s not ok?? Hmmmm…
Maybe that’s a problem…