LLMs can unmask pseudonymous users at scale with surprising accuracy

https://lemmy.world/post/43819988

LLMs can unmask pseudonymous users at scale with surprising accuracy - Lemmy.World

Lemmy

From a Facebook post I made on February 17th:

There are giant AI data firms that promise they can go through massive troves of data and pull out general and specific information from them. Information that is actionable and accurate. Give it 6 million data points and it’ll find all the links and organize them for you and unmask hidden details that aren’t visible to the naked eye.

Not one of those companies is stepping up to go through the publicly released Epstein files.

There were reports of people trying to unredact the files almost immediately.
But that’s not the same, is it?
I don’t think you can do literally the same thing on the Epstein files. Maybe I’m misunderstanding what you have in mind.
In theory, using the information and the released files and the information the public sources, it should be possible to figure out who those redacted names are based on writing style and other factors. We should be able to deanonymize.
Hmm. Maybe but it is not the same problem as those discussed in OP. I also have some doubts about the paper, but that’s another story. You could try it out?
I’m not qualified to design the prompts and home users can’t really pile in 3 million+ documents.

Prompts are in the appendix: arxiv.org/abs/2602.16800

I don’t know how far you get on the free tier but it should be at least enough for a proof of principle; to get other people to chip in. You didn’t have qualms demanding other people should do this for free.

Mind that this is a serious GDPR violation in Europe. So there will be serious pressure on AI companies to prevent this kind of use.

Large-scale online deanonymization with LLMs

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to classical deanonymization work (e.g., on the Netflix prize) that required structured data, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

arXiv.org

Seriously, I’m not qualified. No amount of appendix prompts and Dunning Kruger is going to change that.

I’m not demanding anything. I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.

You think the paper is fraud?
My statement that I’m quoting predates this paper. My statement exists completely independent of this paper ever being produced. My statement is not about this paper. My statement is about the state of AI and the industry. This paper reinforces my statement.
How so?

My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.

I’m unclear on if we’re having a good faith conversation because I thought that would have been very clear from the beginning.

You said: I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.

You’re also saying: My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.

I can’t make sense of what you are trying to say.

Did you see the “or” in my first statement?
I still can’t make sense of what you are trying to say.
I set up two different, not necessarily exclusive, options. Either it can’t do what they say or it can. If it can’t then that’s one issue. If it can then the people with something to prove aren’t stepping up to show us its potential. There could be multiple motivations behind that. But as it stands right now we just know that it’s not being used to do what they claim.

But as it stands right now we just know that it’s not being used to do what they claim.

Wait. How do we know this? Besides, these researchers show that it is possible, not that it is established practice.

What is going on here? Something isn’t right about this conversation. We should not be this confused and talking past each other.

True or false: there has been no release by an AI company or anyone using AI to unmask the individuals obscured in the Epstein files.

I doubt a reputable company would do that, except in cooperation with the authorities. Some people have used AI in an attempt to do that, but I’m not familiar with the details.

I don’t really understand what you expect from who and why.

Can you state my position to me in terms I would agree with?

Probably not.

I don’t know what AI companies you mean here. From context, I’m guessing that you don’t mean the likes of Anthropic, but rather companies that do sleuthing on the net, like those firms that look for copyright or trademark violations. I’m not familiar with that industry and don’t know their marketing material. Maybe that’s the problem.

I don’t know what claims they make, or how it relates to the Epstein files, or OP.