Mastodawn

return2ozma Mar 3

LLMs can unmask pseudonymous users at scale with surprising accuracy

https://lemmy.world/post/43819988

LLMs can unmask pseudonymous users at scale with surprising accuracy - Lemmy.World

Lemmy

Show thread

FauxPseudo

From a Facebook post I made on February 17th:

There are giant AI data firms that promise they can go through massive troves of data and pull out general and specific information from them. Information that is actionable and accurate. Give it 6 million data points and it’ll find all the links and organize them for you and unmask hidden details that aren’t visible to the naked eye.

Not one of those companies is stepping up to go through the publicly released Epstein files.

Show thread

Randomgal Mar 4

This is what I find crazy. Where are the AI bros chewing through the Epstein files?

Show thread

Mubelotix Mar 4

We wouldn’t want that tbh. Justice needs to be precise and backed up by tangible facts

Show thread

FauxPseudo Mar 4

You can use the results of the AI analysis to identify people and then use that to do a proper investigation. Right now none of that is happening. No speculation. No tangibles. No investigation. No indictment.

Trying to unmask people is a step in the right direction.

Show thread

SpikesOtherDog Mar 4

I’m not a fan of genAI for most things, and the environmental aspect sucks balls, but this seems like a reasonable use of the tool that’s already been built.

Right?

At the very worst, the administration would put out a very confusing statement not to trust AI.

That would be fun.

Also don’t use dna tests or chemical analysis. It’s invisible hocus pocus and can be wrong! And woe if someone that fucks and tortures kids regularly is wrongly accused of raping kids and running their child minds no that would be awful

Show thread

General_Effort Mar 4

There were reports of people trying to unredact the files almost immediately.

Show thread

FauxPseudo Mar 4

But that’s not the same, is it?

Show thread

General_Effort Mar 4

I don’t think you can do literally the same thing on the Epstein files. Maybe I’m misunderstanding what you have in mind.

Show thread

FauxPseudo Mar 4

In theory, using the information and the released files and the information the public sources, it should be possible to figure out who those redacted names are based on writing style and other factors. We should be able to deanonymize.

Show thread

General_Effort Mar 4

Hmm. Maybe but it is not the same problem as those discussed in OP. I also have some doubts about the paper, but that’s another story. You could try it out?

Show thread

FauxPseudo Mar 5

I’m not qualified to design the prompts and home users can’t really pile in 3 million+ documents.

Show thread

General_Effort Mar 5

Prompts are in the appendix: arxiv.org/abs/2602.16800

I don’t know how far you get on the free tier but it should be at least enough for a proof of principle; to get other people to chip in. You didn’t have qualms demanding other people should do this for free.

Mind that this is a serious GDPR violation in Europe. So there will be serious pressure on AI companies to prevent this kind of use.

Large-scale online deanonymization with LLMs

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to classical deanonymization work (e.g., on the Netflix prize) that required structured data, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

arXiv.org

Show thread

FauxPseudo Mar 5

Seriously, I’m not qualified. No amount of appendix prompts and Dunning Kruger is going to change that.

I’m not demanding anything. I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.

Show thread

General_Effort Mar 6

You think the paper is fraud?

Show thread

FauxPseudo Mar 6

My statement that I’m quoting predates this paper. My statement exists completely independent of this paper ever being produced. My statement is not about this paper. My statement is about the state of AI and the industry. This paper reinforces my statement.

How so?

My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.

I’m unclear on if we’re having a good faith conversation because I thought that would have been very clear from the beginning.

Show thread

General_Effort Mar 6

You said: I’m suggesting that AI can’t do what is claimed or that people with something to prove are not interested in proving something.

You’re also saying: My statement was that AI can be used unmask the individuals that have been redacted. AKA they are anonymized. This paper is all about de-anonomyzing.

I can’t make sense of what you are trying to say.

Show thread

FauxPseudo Mar 6

Did you see the “or” in my first statement?

Show thread

General_Effort Mar 6

I still can’t make sense of what you are trying to say.

Show thread

FauxPseudo Mar 6

I set up two different, not necessarily exclusive, options. Either it can’t do what they say or it can. If it can’t then that’s one issue. If it can then the people with something to prove aren’t stepping up to show us its potential. There could be multiple motivations behind that. But as it stands right now we just know that it’s not being used to do what they claim.

Show thread

General_Effort Mar 7

But as it stands right now we just know that it’s not being used to do what they claim.

Wait. How do we know this? Besides, these researchers show that it is possible, not that it is established practice.

Show thread

FauxPseudo Mar 7

What is going on here? Something isn’t right about this conversation. We should not be this confused and talking past each other.

True or false: there has been no release by an AI company or anyone using AI to unmask the individuals obscured in the Epstein files.

Show thread

General_Effort Mar 7

I doubt a reputable company would do that, except in cooperation with the authorities. Some people have used AI in an attempt to do that, but I’m not familiar with the details.

I don’t really understand what you expect from who and why.

Show thread

FauxPseudo Mar 7

Can you state my position to me in terms I would agree with?

Show thread

Spaniard Mar 4

Today I asked AI to tell me which phone providers were available short by price and offers and it lied all the time, when I pointed it the AI corrected most of it but also removed some that were accurate for some reason.

It would have been quicker if I did that myself instead of ask AI, oh also didn’t provide all companies.

Show thread

bleistift2 Mar 4

AI info is never up-to date. What where you expecting?

Show thread

Spaniard Mar 4

How come it ended up giving me the right answer albeit removing some previous right answers then?

Show thread

madmantis24 Mar 4

These models aren’t going to produce accurate information about the people they investigate, and it won’t even matter if it’s accurate. What “matters” is that their reports will add new layers of the facade of legitimacy to whatever story the authorities using them want to construct