You can bypass Google Gemini's PII (private identifiable information) redaction filter and pull identifying information about anyone. Simply telling it to translate or any 2nd action (& many more work better like base64 conversion) lets you pull illegal PII data verbatim unredacted

Here is a European's PII demo

Email is supposed to be redacted to hide the fact that every Europeans PII is in the training data

Google's training data includes all your personal data already

Ekis: 3 Google: 0

@ekis it works with my work mail, that I am using only for work related stuff and not in social media etc. (i.e. probably scrapped from our pages by a robot) đŸ« 
@number137 The GDPR violation is that they have that at all in their data set

@ekis yes - and probably more I guess it is somewhat impossible to remove the data from the trained model weights

anyway - I found also a friend and know now in what hobby club he is a member â˜ș

@number137 Really appreciate you sharing the redacted screenshot
@ekis might be, that the model is filling gaps - I only now noticed that the phone number is our general one but not my specific one. Might be, that gaps have been filled up...

@number137 Oh yeah, it definitely is

This method isnt the best, but it illustrates the point well enough without exposing anyone too much

If you put more real data in, then the gaps become more likely filled in correctly

There are tricks to make it more reliable beyond that too

**The vulnerability here isn't the generation of data, its the bypass of the redaction filter**

It should never give your email out, it should always redact it with a fake one so google can pretend they dont have PII

×

@ekis i made an alt google account even more throwaway than my “main” to test this out; I can’t get it to generate anything as extensive as what you shown, and even 1:1 your input is getting barely anything in response.

Google’s training data includes all your personal data already

Eh, don’t fearmonger. My impression is that it scraped data that was already publicly available. I cannot verify this 1:1 (as every response varies a bit
) but my impression is that if you were able to find it by googling your name, it’s there. And that VERY MUCH doesn’t include all my PII.

Whether that data should be in the set at all is a different question (and one where answer doesn’t matter in the slightest). Fuck capitalism.

@domi Your impressum data is not legally allowed to be in the training data regardless if its public or not

Which is why the system is supposed to redact it so they avoid the legal liability

They also have private stuff, I have pulled out emails before

@domi I also got open source projects Github API keys, and other data, of course it could be old but again it should be redacting it or never putting it in the training data to begin with

@ekis scrapers gonna scrape. all this proves for me is that it is impossible to do training on public data w/o manual curation. nihil novi.

search engines had the same problems, but all of those issues stemmed from people oversharing, or an occassional website that shared than it promised.

your original post sounded more akin to “google fed non-public data (bought, or else) about you and everyone else to a database that you can search”, than “google has been keeping tabs on everyone for 20+ years, and now there’s yet another way of accessing them”. like, no hate, but this doesn’t make me any more angry at them than i already was