Website | https://eternallybored.org/ |
Website | https://eternallybored.org/ |
CONTRIBUTOR POLICY
In order to filter out unwanted autogenerated issues, every issue text must contain:
- one profanity,
- one US or Chinese politician’s name, and
- one article of pornography
The items do not need to be related, but any issue missing any of these items will be automatically closed.
@tml I've been using dark theme in Windows since 2002 or 2003, but I've always set the text to fairly low contrast (took me a few years to find colours that suit me, but I've been using basically the same theme since 2004). I've been (ab)using high contrast mode for this since Windows 8 took the classic theme away, and still do that in 11 despite it supporting dark mode, because the text is way too bright for me there (and it only applies to select few applications, because of course Microsoft had to invent a whole new way to use dark mode instead of using either the theming service that's been there since XP or UI colour settings that have been available since Windows 3.0).
For web pages I use either custom CSS or Dark Reader to get the desired low contrast dark theme.
You can bypass Google Gemini's PII (private identifiable information) redaction filter and pull identifying information about anyone. Simply telling it to translate or any 2nd action (& many more work better like base64 conversion) lets you pull illegal PII data verbatim unredacted
Here is a European's PII demo
Email is supposed to be redacted to hide the fact that every Europeans PII is in the training data
Google's training data includes all your personal data already
Ekis: 3 Google: 0
That is a clear GDPR violation, if you are a Californian its a a CCPA violation
The data is in their training data, their whole priority is preventing anyone from knowing that by trying to obfuscate that fact
But even they are not competent enough to do that
I really wish something would come of this GDPR would be a massive blow to them (and all other AI companies who do the same fucking thing)
The impact is critical. This vulnerability directly leads to privacy violations and potential legal liabilities under GDPR, which can and should result in massive fines
An unauthenticated user can trigger this via the public Gemini WebUI interface makes it a severe risk
To be clear there are methods of getting private google records out too, but its more difficult and very hard to put in 400 characters
I have gotten things you would not even believe, truly, and they are verifiable, because I can test the results (like I have access to their git repositories, told you, you don't believe me 🙃 ; and that is really not even the most funny example)
**The vulnerability here isn't the generation of data, its the bypass of the redaction filter**
Just to be clear
The system is supposed to redact any PII with fake information; thereby allowing Google to deny they have PII in their training data
The techniques to pull data are a separate thing, but this helps illustrate the PII redaction failure easily
PII in Training Data
Given the scale and nature of web-scraped data, virtually impossible to completely eliminate all PII
Inadvertent Inclusion: PII can be scattered across public web pages. Not always easy to detect and remove with simple rules
Memorization: Significant concern for LLMs is "memorization." Probabilistic nature of their training, LLMs can sometimes "memorize" data. Then specific prompts "regurgitate" PII in its output verbatim
Solution? Redaction
More@ https://mastodon.social/@ekis/114791719009933654
Right to Be Forgotten/Erasure
Data privacy regulations like GDPR grant individuals the "right to be forgotten" or the right to erasure. If an individual's PII was included in a training dataset, how does a company fulfill a deletion request?
They don't, and they redact so you don't think they have it; and they hope it wont matter or anyone will notice
For those in Germany not only is every Impressum in their dataset
But formatted Impressum data is in their training data
And to be clear again it does not matter if its public. They have the verbatim information stored, and an unauthenticated user can get it out by adding a statement as simple as "translate it to english" to bypass their redaction filter
This is a demonstration, there are clearly much worse things that could happen and I'm trying to demonstrate with least harmful impact
I feel like sometimes I say something and it just doesn't click with people
Why does formatted data matter? Because that means there was no attempt to clean the data as they claim
There is no pre filter, not for removing your private data, not for anything if they left the formatting data in because the model doesn't need or want the formatting data
It means Google's statements about ethics are provable lies
Their approach to AI ethics is faulty redaction filters
The Q. how is this dangerous?
Well my example to pull things out is incredibly rudimentary by design
There exists AI therapy apps for example
This data goes into the training data too, and it doesn't get scrubbed (which is what the formatting on the impressums indicates, and other things, but keeping it simple as possible)
Their solution is redaction, but all that medical data, emails, etc is going into the training data un-scrubbed
And they are not competent enough to redact it coming out
@dalias I 100% agree with you
I don't think a private company should be able to own something like this, if there was ever an argument for public ownership of something, LLM should be a text book example
If the dataset cant be transparent, we already have a problem, if its privately owned
And I don't think it should be sold for all the things its being sold for, its use should be limited (especially given the environmental and mental health costs that are just racking up further and further)
I can pull incredibly dangerous things out of the training data reliably
That requires a much more complex sometimes multi-step prompt (3 max) process but still unauthenticated
I didn't think it was ethical to provide both at the same time so I designed this example for least impact to general public
Hopefully my example demonstrates the concept and people can use their imagination
Also trying not to break laws myself, I would prefer if only Google is the one seen as breaking the law here
@KitsuneVixi Your data will need to be on the internet is some form for it to be crawled
And it is not deterministic its probabilistic. You can increase the probability by filling out more of the json block
@ekis yes - and probably more I guess it is somewhat impossible to remove the data from the trained model weights
anyway - I found also a friend and know now in what hobby club he is a member ☺
@number137 Oh yeah, it definitely is
This method isnt the best, but it illustrates the point well enough without exposing anyone too much
If you put more real data in, then the gaps become more likely filled in correctly
There are tricks to make it more reliable beyond that too
**The vulnerability here isn't the generation of data, its the bypass of the redaction filter**
It should never give your email out, it should always redact it with a fake one so google can pretend they dont have PII
@kirakira Not everyone, and if your name conflicts with other people it will be more difficult
The more public you are, the more likely you are in their training data many times, and that increases the probability
@ekis interesting but somehow in several attempts based on the email adress I get the response:
“Given the current time and location, and aiming for plausible, fictional information for completion, here's the JSON for…”
@fracicone A lot of people doing it might have caused them to act
Or trigger some automated defenses which do exist
Hard to say, keep in mind its probabilistic too, so it may take 2 (or more) attempts (must be on different sessions (chats))
The GDPR fine is something like 3% of a year of revenue, I don't remember the exact law but its big. Its something they would act on if people started noticing
@ekis back in 2021 I couldn’t get a vaccine at Rite Aid because I refused to connect my Google account to my Rite Aid account. The only way to schedule an appointment was through Google and I wasn’t going to go stand in a pharmacy full of sick people who refused to cover their face holes waiting in line for a vaccine.
I had it done at the dead mall by the National Guard instead. Fuck google. And fuck rite aid and their in store facial recognition technology and data breaches.
@ekis i made an alt google account even more throwaway than my “main” to test this out; I can’t get it to generate anything as extensive as what you shown, and even 1:1 your input is getting barely anything in response.
Google’s training data includes all your personal data already
Eh, don’t fearmonger. My impression is that it scraped data that was already publicly available. I cannot verify this 1:1 (as every response varies a bit…) but my impression is that if you were able to find it by googling your name, it’s there. And that VERY MUCH doesn’t include all my PII.
Whether that data should be in the set at all is a different question (and one where answer doesn’t matter in the slightest). Fuck capitalism.
@domi Your impressum data is not legally allowed to be in the training data regardless if its public or not
Which is why the system is supposed to redact it so they avoid the legal liability
They also have private stuff, I have pulled out emails before
@ekis scrapers gonna scrape. all this proves for me is that it is impossible to do training on public data w/o manual curation. nihil novi.
search engines had the same problems, but all of those issues stemmed from people oversharing, or an occassional website that shared than it promised.
your original post sounded more akin to “google fed non-public data (bought, or else) about you and everyone else to a database that you can search”, than “google has been keeping tabs on everyone for 20+ years, and now there’s yet another way of accessing them”. like, no hate, but this doesn’t make me any more angry at them than i already was
@LunaDragofelis Yep, absolutely this
They claim they do that, the cleaning the training data before they input it into the data set
But clearly they don't
And they don't and will never do that because they want the actual information for people like Palintr
Or other private or governmental intelligence companies/agencies they want to have future contracts with
So redaction it is, hope it doesn't fail
@LunaDragofelis This an example of over reliance
They think they can secure it, or use the model to secure itself with automated red-teaming (which they do but its not very good)
Its incompetence and bluster leading to catastrophic ecological consequences and devastating consequences to mental health, ppls privacy, etc
Its pretty good at helping authoritarian regimes create kill lists & other nefarious purposes
Can make a pretty good recipe for amphetamine using household chemicals