Mastodawn

Blaze Apr 30, 2024

Reddit if full of bots: thread reposted exactly the same, comment by comment, 10 months later

https://lemmy.blahaj.zone/post/11615413

Reddit if full of bots: thread reposted exactly the same, comment by comment, 10 months later - Blåhaj Lemmy

Show thread

Anti-Face Weapon Apr 30, 2024

My understanding of how this works is that that left one is real accounts making real comments, at least in the majority.

Then when the link gets reposted, either by a bot or naturally, potentially depending on the title, the bots scrape the old comments and post them.

It’s content farming. And Reddit is probably okay with this.

Show thread

livus Apr 30, 2024

Reddit is going to poison LLMs sooner than I thought.

Show thread

bjorney Apr 30, 2024

Reddit probably omits bot accounts when it sells its data to AI companies

Show thread

phdepressed Apr 30, 2024

I doubt Reddit is in charge of many of the existing bots on their site.

Show thread

bjorney Apr 30, 2024

Reddit has access to its own data - they absolutely know which users are posting unique content and which user’s content is a 100% copy of data that exists elsewhere on their own platform

Show thread

phdepressed Apr 30, 2024

I know they could be I’m just not sure they’re that competent. These bots often aren’t single user or just copy paste either, there’s usually some effort to mix it up or change wording slightly. Reddits internal search function is infamously shit but they “know” which users are unlabeled bots with some effort put behind them?

Show thread

bjorney Apr 30, 2024

I know everyone here likes to circle jerk over “le Reddit so incompetent” but at the end of the day they are a (multi) billion dollar company and it’s willfully ignorant to infer that there isn’t a single engineer at the company who knows how to measure string similarity between two comment trees (hint: import difflib in python)

Show thread

icydefiance

To compare every comment on reddit to every other comment in reddit’s entire history would be a pretty major performance problem, and if you want to find similar comments instead of exact matches, it becomes a lot harder to do that efficiently. ElasticSearch might be able to do it, but then you need to duplicate all of that data in a separate database and keep it in sync with your main database without affecting performance too much when people are leaving new comments.

Comparing combinations of comments is probably impossible. Reddit has a massive number of comments to begin with, and the number of possible subtrees of those comments would just be absurd.

Programmers just do what they’re told. If the managers don’t care about something, the programmers won’t work on it.

Show thread

bjorney Apr 30, 2024

To compare every comment on reddit to every other comment in reddit’s entire history would require an index

You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads? A cursory glance at their engineering blog indicates they perform much more computationally demanding tasks on comment data already for purposes of content filtering

you need to duplicate all of that data in a separate database and keep it in sync with your main database without affecting performance too much

Analytics workflows are never run on the production database, always on read replicas which are taken asynchronously and built from the transaction logs. They likely have an ETL tool

Programmers just do what they’re told. If the managers don’t care about something, the programmers won’t work on it.

Reddit’s entire monetization strategy is collecting user data and selling it to advertisers - It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement

Show thread

icydefiance Apr 30, 2024

You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads?

I’m sure they have, but an index doesn’t have anything to do with the python library you mentioned.

Analytics workflows are never run on the production database, always on read replicas

Sure , either that or aggregating live streams of data so I don’t even need a read replica, but either way it doesn’t have anything to do with ElasticSearch.

It’s still totally possible to sync things to ElasticSearch in a way that won’t affect performance on the production servers, but I’m just saying it’s not entirely trivial, especially at the scale reddit operates at, and there’s a cost for those extra servers and storage to consider as well.

It’s hard for us to say if that math works out.

It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement

You would think, but you could say the same about Facebook and I know from experience that they don’t give a fuck about bots. If anything they actually like the bots because it looks like they have more users.