Reproducing Hacker News writing style fingerprinting
Reproducing Hacker News writing style fingerprinting
Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
Imagine the 2D space, it also has the same property!
You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.
Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:
redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose
montrose 0.8640020787715912
redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg
pg 0.8639097809791565
So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.
How does it find the high similarity between "dang" and "dangg" when the "dangg" account has no activity (like comments) at all?
https://antirez.com/hnstyle?username=dang&threshold=20&actio...
The "analyze" feature works pretty well.
My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.
They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")
My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.
In case anyone cares.
Well, well, well, cocktailpeanuts. :spiderman_pointing:
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
because, people, you're, don't, they're, software, that, but, you, want
Unfortunately, this is a forum where people will use words like "because, people, and software."
Because, well, people here talk about software.
<=^)
Edit: Neat work, nonetheless.
I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.
You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.
( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.
This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100My favorite which is also up to date is the ClickHouse playground.
For example:
SELECT * FROM hackernews_history ORDER BY time DESC LIMIT 10;
I subscribe to this issue to keep up with updates:
https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...
And ofc, for those that don't know, the official API https://github.com/HackerNews/API
so the website processes only comments older than 2023?
not very useful for more newer users like me :/