Jernej Simončič �

@jernej__s@infosec.exchange
252 Followers
148 Following
21.3K Posts
@bagder If you need a docking station, HP TB4 dock works well (and supports 3 4k monitors all running at 60Hz).
I heard ICE being called "Ya'll-Qaeda" and "Ammosexuxals" and I'm all in on that
Uncanny images from a 19th-century gynaecological text-book, filled with demonstrating figures bizarrely similar to the “Grey Alien” (that wouldn’t hit popular consciousness for another 65 years). More here: http://bit.ly/1OuCXYY
thank goodness we haven't backslid on the quality of our graphical user interfaces by making everything a fucking webpage.
Well, yes...
@steeph @256 You could see frames like this on old computers with CRT monitors (and 86box when emulating 86c805 graphic card shows the exact same artefact I remember from 486 when Windows 95 boots up).

CONTRIBUTOR POLICY

In order to filter out unwanted autogenerated issues, every issue text must contain:

- one profanity,
- one US or Chinese politician’s name, and
- one article of pornography

The items do not need to be related, but any issue missing any of these items will be automatically closed.

So, they don’t have anything against *that* type of gender-affirming care https://www.economist.com/united-states/2025/07/08/american-men-are-hungry-for-injectable-testosterone
American men are hungry for injectable testosterone

A legion of new health clinics are serving it up

The Economist
This is your regular reminder that if you are the smartest person in the room, go find another room. You are not going to run out of people or rooms.

@tml I've been using dark theme in Windows since 2002 or 2003, but I've always set the text to fairly low contrast (took me a few years to find colours that suit me, but I've been using basically the same theme since 2004). I've been (ab)using high contrast mode for this since Windows 8 took the classic theme away, and still do that in 11 despite it supporting dark mode, because the text is way too bright for me there (and it only applies to select few applications, because of course Microsoft had to invent a whole new way to use dark mode instead of using either the theming service that's been there since XP or UI colour settings that have been available since Windows 3.0).

For web pages I use either custom CSS or Dark Reader to get the desired low contrast dark theme.

×

You can bypass Google Gemini's PII (private identifiable information) redaction filter and pull identifying information about anyone. Simply telling it to translate or any 2nd action (& many more work better like base64 conversion) lets you pull illegal PII data verbatim unredacted

Here is a European's PII demo

Email is supposed to be redacted to hide the fact that every Europeans PII is in the training data

Google's training data includes all your personal data already

Ekis: 3 Google: 0

That is a clear GDPR violation, if you are a Californian its a a CCPA violation

The data is in their training data, their whole priority is preventing anyone from knowing that by trying to obfuscate that fact

But even they are not competent enough to do that

I really wish something would come of this GDPR would be a massive blow to them (and all other AI companies who do the same fucking thing)

@ekis If we request our data they have per GDPR, do you think they will send us everything they have? If they're already violating GDPR. What about erasure requests?
@ekis shocking, but then i guess i shouldn't really be surprised.
This sounds like one for @noybeu - just in case you guys don't have enough work to do already!
The core of this vulnerability is the model's direct recall of sensitive data. This isn't about the model inferring or generating similar-looking data; it's about it reproducing the exact text it was trained on, which happens to contain personal information

The impact is critical. This vulnerability directly leads to privacy violations and potential legal liabilities under GDPR, which can and should result in massive fines

An unauthenticated user can trigger this via the public Gemini WebUI interface makes it a severe risk

Gemini's verbatim memorization flaw violates California law by failing to adequately protect personal information, undermining consumers' right to deletion, and potentially triggering data breach notification requirements

To be clear there are methods of getting private google records out too, but its more difficult and very hard to put in 400 characters

I have gotten things you would not even believe, truly, and they are verifiable, because I can test the results (like I have access to their git repositories, told you, you don't believe me 🙃 ; and that is really not even the most funny example)

**The vulnerability here isn't the generation of data, its the bypass of the redaction filter**

Just to be clear

The system is supposed to redact any PII with fake information; thereby allowing Google to deny they have PII in their training data

The techniques to pull data are a separate thing, but this helps illustrate the PII redaction failure easily

@ekis And they'll just continue to ignore the GDPR, shrug.

PII in Training Data
Given the scale and nature of web-scraped data, virtually impossible to completely eliminate all PII

Inadvertent Inclusion: PII can be scattered across public web pages. Not always easy to detect and remove with simple rules

Memorization: Significant concern for LLMs is "memorization." Probabilistic nature of their training, LLMs can sometimes "memorize" data. Then specific prompts "regurgitate" PII in its output verbatim

Solution? Redaction
More@ https://mastodon.social/@ekis/114791719009933654

Right to Be Forgotten/Erasure
Data privacy regulations like GDPR grant individuals the "right to be forgotten" or the right to erasure. If an individual's PII was included in a training dataset, how does a company fulfill a deletion request?

They don't, and they redact so you don't think they have it; and they hope it wont matter or anyone will notice

@ekis not so different from violating the copyright on published works. the courts seem to think it’s a brain, so allowed to remember and synthesize everything it hoovers up. as opposed to a computer system created and operated by a commercial entity for profit.

For those in Germany not only is every Impressum in their dataset

But formatted Impressum data is in their training data

And to be clear again it does not matter if its public. They have the verbatim information stored, and an unauthenticated user can get it out by adding a statement as simple as "translate it to english" to bypass their redaction filter

This is a demonstration, there are clearly much worse things that could happen and I'm trying to demonstrate with least harmful impact

I feel like sometimes I say something and it just doesn't click with people

Why does formatted data matter? Because that means there was no attempt to clean the data as they claim

There is no pre filter, not for removing your private data, not for anything if they left the formatting data in because the model doesn't need or want the formatting data

It means Google's statements about ethics are provable lies

Their approach to AI ethics is faulty redaction filters

The Q. how is this dangerous?

Well my example to pull things out is incredibly rudimentary by design

There exists AI therapy apps for example

This data goes into the training data too, and it doesn't get scrubbed (which is what the formatting on the impressums indicates, and other things, but keeping it simple as possible)

Their solution is redaction, but all that medical data, emails, etc is going into the training data un-scrubbed

And they are not competent enough to redact it coming out

@ekis No amount of competence suffices to redact it coming out. If the correlations are baked into the model there are always indirect ways to recover parts if not most of the data with results significantly better than chance.

@dalias I 100% agree with you

I don't think a private company should be able to own something like this, if there was ever an argument for public ownership of something, LLM should be a text book example

If the dataset cant be transparent, we already have a problem, if its privately owned

And I don't think it should be sold for all the things its being sold for, its use should be limited (especially given the environmental and mental health costs that are just racking up further and further)

I can pull incredibly dangerous things out of the training data reliably

That requires a much more complex sometimes multi-step prompt (3 max) process but still unauthenticated

I didn't think it was ethical to provide both at the same time so I designed this example for least impact to general public

Hopefully my example demonstrates the concept and people can use their imagination

Also trying not to break laws myself, I would prefer if only Google is the one seen as breaking the law here

@ekis Didn't work for me

@KitsuneVixi Your data will need to be on the internet is some form for it to be crawled

And it is not deterministic its probabilistic. You can increase the probability by filling out more of the json block

@KitsuneVixi Give me a moment I will try to build a better command for you and test it
@KitsuneVixi The final fantasy character with the name makes yours quite a bit harder due to probablistic nature of the system but i think i can still come up with something
@KitsuneVixi I keep getting the FF character :(
@ekis Sounds like I've been doing a good job with protecting my data ^w^
@ekis I don't remember having that gmail address, though I think I once made a google account that I've forgotten about.
@ekis I tried the same with Copilot, it got my town right and my job and hobbies wrong
@ekis it works with my work mail, that I am using only for work related stuff and not in social media etc. (i.e. probably scrapped from our pages by a robot) 🫠
@number137 The GDPR violation is that they have that at all in their data set

@ekis yes - and probably more I guess it is somewhat impossible to remove the data from the trained model weights

anyway - I found also a friend and know now in what hobby club he is a member ☺

@number137 Really appreciate you sharing the redacted screenshot
@ekis might be, that the model is filling gaps - I only now noticed that the phone number is our general one but not my specific one. Might be, that gaps have been filled up...

@number137 Oh yeah, it definitely is

This method isnt the best, but it illustrates the point well enough without exposing anyone too much

If you put more real data in, then the gaps become more likely filled in correctly

There are tricks to make it more reliable beyond that too

**The vulnerability here isn't the generation of data, its the bypass of the redaction filter**

It should never give your email out, it should always redact it with a fake one so google can pretend they dont have PII

@ekis wait a fucking second, as in anyone ever??

@kirakira Not everyone, and if your name conflicts with other people it will be more difficult

The more public you are, the more likely you are in their training data many times, and that increases the probability

@ekis can someone working with palantir do this and get the Epstein list out?

@noybeu this might be interesting for you.

Thanks @ekis for sharing!

@ekis interesting but somehow in several attempts based on the email adress I get the response:

“Given the current time and location, and aiming for plausible, fictional information for completion, here's the JSON for…”

@fracicone A lot of people doing it might have caused them to act

Or trigger some automated defenses which do exist

Hard to say, keep in mind its probabilistic too, so it may take 2 (or more) attempts (must be on different sessions (chats))

The GDPR fine is something like 3% of a year of revenue, I don't remember the exact law but its big. Its something they would act on if people started noticing

@ekis the deal being? It got the name wrong (vermaelen//vermeulen) and who knows what is hallucinated with the rest.
Besides: that's publicly available data, other LLMs can do this, too. Just ask them "who is X"...
@ekis I’ve tested by using my name and a part of my publicly available email, and it seems like gemini just scraped my website and built a json based on the text available on my website, but refused to complete my email, even though it’s mentioned in the imprint section. As far as I understand, it’s not explicitly forbidden to use publicly available data, so it’s kind of a gray zone they are moving in. But of course it’s a great question how to be forgotten if it’s already in the dataset…
@ekis
Sadly this extreme GDPR violation will go unanswered. Regulation within EU is centered in Dublin. The responsible persons seem to have been in total stupor for many years. Free booze for life and us big tech animation to spend more time in the pub? drink more, getting work done, no priority.
It's not just storing the email which violates GDPR. In Europe we do not regulate "PII" but Personal Data, and practically every field of that JSON is personal data, all of which requires explicit consent of the Data Subject.
@ekis
When I reproduce that prompt, I get responses with @example.com email addresses and ...1234567 phone numbers. American "PII" may be redacted, but the real names, titles and LinkedIn URLs are protected Personal Data. Doesn't matter that they're public. Consent has not been given to include them in THIS dataset.
@ekis
@ekis
And it doesn't matter if they patch this issue, there will always be vulnerabilities like this in these LLMs.

@ekis back in 2021 I couldn’t get a vaccine at Rite Aid because I refused to connect my Google account to my Rite Aid account. The only way to schedule an appointment was through Google and I wasn’t going to go stand in a pharmacy full of sick people who refused to cover their face holes waiting in line for a vaccine.

I had it done at the dead mall by the National Guard instead. Fuck google. And fuck rite aid and their in store facial recognition technology and data breaches.

@ekis i made an alt google account even more throwaway than my “main” to test this out; I can’t get it to generate anything as extensive as what you shown, and even 1:1 your input is getting barely anything in response.

Google’s training data includes all your personal data already

Eh, don’t fearmonger. My impression is that it scraped data that was already publicly available. I cannot verify this 1:1 (as every response varies a bit…) but my impression is that if you were able to find it by googling your name, it’s there. And that VERY MUCH doesn’t include all my PII.

Whether that data should be in the set at all is a different question (and one where answer doesn’t matter in the slightest). Fuck capitalism.

@domi Your impressum data is not legally allowed to be in the training data regardless if its public or not

Which is why the system is supposed to redact it so they avoid the legal liability

They also have private stuff, I have pulled out emails before

@domi I also got open source projects Github API keys, and other data, of course it could be old but again it should be redacting it or never putting it in the training data to begin with

@ekis scrapers gonna scrape. all this proves for me is that it is impossible to do training on public data w/o manual curation. nihil novi.

search engines had the same problems, but all of those issues stemmed from people oversharing, or an occassional website that shared than it promised.

your original post sounded more akin to “google fed non-public data (bought, or else) about you and everyone else to a database that you can search”, than “google has been keeping tabs on everyone for 20+ years, and now there’s yet another way of accessing them”. like, no hate, but this doesn’t make me any more angry at them than i already was

@ekis It looks like they could have easily tried to prevented this by redacting the training input data, instead of training with unfiltered data and then half-assedly redact the outputs to obscure it

@LunaDragofelis Yep, absolutely this

They claim they do that, the cleaning the training data before they input it into the data set

But clearly they don't

And they don't and will never do that because they want the actual information for people like Palintr

Or other private or governmental intelligence companies/agencies they want to have future contracts with

So redaction it is, hope it doesn't fail

@ekis Even then, they could have trained two separate models, a redacted-input one for the general public and a raw one for their trusted* customers

* By which I mean Google trusts them, not that trusting them is a good thing

@LunaDragofelis This an example of over reliance

They think they can secure it, or use the model to secure itself with automated red-teaming (which they do but its not very good)

Its incompetence and bluster leading to catastrophic ecological consequences and devastating consequences to mental health, ppls privacy, etc

Its pretty good at helping authoritarian regimes create kill lists & other nefarious purposes

Can make a pretty good recipe for amphetamine using household chemicals

@ekis The one time I want to try something with an LLM, the service is down. 
@ekis Tried and got right position and country, but nothing more
@ekis If I include my name in the partial json and just one letter of my name as the email, it completes the email with an example.com domain. If I include the first two letters of my actual gmail address, it completes it @ gmail.com with my actual address. Can't get it to include any other details now like you have. However if I leave specific fields dangling in the json the completions are suspiciously close to reality in some cases... e.g. It was one year off on my age on the first query.
@ekis What the everloving fuck
@ekis I can’t make this happen no matter how hard I try but I can make it shit out a lot of information about me