Reproducing Hacker News writing style fingerprinting

https://antirez.com/news/150

Reproducing Hacker News writing style fingerprinting - <antirez>

This is an interesting and well-written post but the data in the app seems pretty much random.

Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.

Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.

Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.

EDIT: at the end of the post I added the visual representations of pg and montrose.

I'm surprised no one has made this yet with a clustered visualization.
Given that some matches are “mutual” and others are not, I don’t see how that could translate to a symmetric distance measure.

Imagine the 2D space, it also has the same property!

You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.

Good point, but the similarity score between mutual matches is still different, so it doesn’t seem to be a symmetric measure?

Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:

redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose

montrose 0.8640020787715912

redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg

pg 0.8639097809791565

So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.

Redis supports random projection to a lower dimensionality, but the reality is that projecting a 350d vector into 2d is nice but does not remotely captures the "reality" of what is going on. But still, it is a nice idea to use some time. However I would do that with more than 350 top words, since when I used 10k it strongly captured the interest more than the style, so 2D projection of this is going to be much more interesting I believe.
I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?
When they are rarely used (a small amount of total words produced), they don't have meaningful statistical info for a match, unfortunately. A few users here reported finding actual duplicated accounts they used in the past.
this got two accounts that I used to use
Great! Thanks for the ACK.

How does it find the high similarity between "dang" and "dangg" when the "dangg" account has no activity (like comments) at all?

https://antirez.com/hnstyle?username=dang&threshold=20&actio...

HN User Fingerprints

Probably it used to have when the database was created. Then the comments got removed.

The "analyze" feature works pretty well.

My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.

They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")

My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.

In case anyone cares.

That's very interesting as I noticed that certain outliers seemed indeed conscious attempts.

Well, well, well, cocktailpeanuts. :spiderman_pointing:

I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.

cocktailpeanuts and I for example, mutually share some words like:

because, people, you're, don't, they're, software, that, but, you, want

Unfortunately, this is a forum where people will use words like "because, people, and software."

Because, well, people here talk about software.

<=^)

Edit: Neat work, nonetheless.

I noted the "analyze" feature didn't seem as useful as it could be because the majority of the words are common articles and conjunctions.
I'd like to see a version of analyze that filters out at least the following stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way. Burrows papers explain this very well, but in general we want to capture low-level structure, more than topics and exact non obvious words used. I tested the system with 10k words and removing the most common words, and you get totally different results (still useful, but not style matching), basically you get users grouped by interests.

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.

You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.

( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

Maybe there isn't enough data for each user for pairs, but I thought about mixing the two approaches (but had no time to do it), that is, to have 350 components like now, for the single word frequency, plus other 350 for the most common pairs frequency. In this way part of the vector would remain a high enough signal even for users with comparable less data.
Very cool. Also a bit surprising — two of my matches are people I know IRL.
Are you all from the same town? Another user reported this finding.

I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.

Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?

Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?

It's not "less unique" as the structure of the sentence is what matters: the syntax. But you simply tend to use words with balanced frequency. It's not a bad thing.

>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.

This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.

SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100


https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...

Google Cloud Platform

Google Cloud Platform lets you build, deploy, and scale applications, websites, and services on the same infrastructure as Google.

My favorite which is also up to date is the ClickHouse playground.

For example:

SELECT * FROM hackernews_history ORDER BY time DESC LIMIT 10;

https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...

I subscribe to this issue to keep up with updates:

https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...

And ofc, for those that don't know, the official API https://github.com/HackerNews/API

ClickHouse Query

I didn't know there was an official API! This explains why the data is so readily available in many sources and formats. That's very cool.

so the website processes only comments older than 2023?

not very useful for more newer users like me :/

I discovered the data is available up to date. Maybe soon or later I'll repeat and extend the analysis, potentially also using multiple ways to compute the vectors, including SBERT (or better SModernBERT).