LLMs can unmask pseudonymous users at scale with surprising accuracy
LLMs can unmask pseudonymous users at scale with surprising accuracy
From a Facebook post I made on February 17th:
There are giant AI data firms that promise they can go through massive troves of data and pull out general and specific information from them. Information that is actionable and accurate. Give it 6 million data points and it’ll find all the links and organize them for you and unmask hidden details that aren’t visible to the naked eye.
Not one of those companies is stepping up to go through the publicly released Epstein files.
You can use the results of the AI analysis to identify people and then use that to do a proper investigation. Right now none of that is happening. No speculation. No tangibles. No investigation. No indictment.
Trying to unmask people is a step in the right direction.
Prompts are in the appendix: arxiv.org/abs/2602.16800
I don’t know how far you get on the free tier but it should be at least enough for a proof of principle; to get other people to chip in. You didn’t have qualms demanding other people should do this for free.
Mind that this is a serious GDPR violation in Europe. So there will be serious pressure on AI companies to prevent this kind of use.

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to classical deanonymization work (e.g., on the Netflix prize) that required structured data, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.
Seriously, I’m not qualified. No amount of appendix prompts and Dunning Kruger is going to change that.
I’m not demanding anything. I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.
My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.
I’m unclear on if we’re having a good faith conversation because I thought that would have been very clear from the beginning.
You said: I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.
You’re also saying: My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.
I can’t make sense of what you are trying to say.
But as it stands right now we just know that it’s not being used to do what they claim.
Wait. How do we know this? Besides, these researchers show that it is possible, not that it is established practice.
What is going on here? Something isn’t right about this conversation. We should not be this confused and talking past each other.
True or false: there has been no release by an AI company or anyone using AI to unmask the individuals obscured in the Epstein files.
I doubt a reputable company would do that, except in cooperation with the authorities. Some people have used AI in an attempt to do that, but I’m not familiar with the details.
I don’t really understand what you expect from who and why.
Today I asked AI to tell me which phone providers were available short by price and offers and it lied all the time, when I pointed it the AI corrected most of it but also removed some that were accurate for some reason.
It would have been quicker if I did that myself instead of ask AI, oh also didn’t provide all companies.