Mastodawn

Andy Reid Feb 14, 2024

AI companies are violating a basic social contract of the web and and ignoring robots.txt

https://lemmy.world/post/11951288

AI companies are violating a basic social contract of the web and and ignoring robots.txt - Lemmy.World

Show thread

palordrolap Feb 14, 2024

Put something in robots.txt that isn't supposed to be hit and is hard to hit by non-robots. Log and ban all IPs that hit it.

Imperfect, but can't think of a better solution.

Show thread

Aatube

robots.txt is purely textual; you can't run JavaScript or log anything. Plus, one who doesn't intend to follow robots.txt wouldn't query it.

Show thread

ShitpostCentral Feb 14, 2024

You’re second point is a good one, but you absolutely can log the IP which requested robots.txt. That’s just a standard part of any http server ever, no JavaScript needed.

Show thread

GenderNeutralBro Feb 14, 2024

You’d probably have to go out of your way to avoid logging this. I’ve always seen such logs enabled by default when setting up web servers.

Show thread

BrianTheeBiscuiteer Feb 14, 2024

If it doesn’t get queried that’s the fault of the webscraper. You don’t need JS built into the robots.txt file either. Just add some line like:

here-there-be-dragons.html

Any client that hits that page (and maybe doesn’t pass a captcha check) gets banned. Or even better, they get a long stream of nonsense.

Show thread

4am Feb 14, 2024

server {

name herebedragons.example.com; root /dev/random;

}

Show thread

Aniki 🌱🌿Feb 14, 2024

I wonder if Nginx would just load random into memory and crash if you did this.

Show thread

PlexSheep Feb 14, 2024

Nice idea! Better use /dev/urandom through, as that is non blocking. See here.

When to use /dev/random vs /dev/urandom

Should I use /dev/random or /dev/urandom? In which situations would I prefer one over the other?

Unix & Linux Stack Exchange

Show thread

Aniki 🌱🌿Feb 14, 2024

That was really interesting. I always used urandom by practice and wondered what the difference was.

Show thread

gravitas_deficiency Feb 14, 2024

I actually love the data-poisoning approach. I think that sort of strategy is going to be an unfortunately necessary part of the future of the web.

Show thread

ricecake Feb 14, 2024

People not intending to follow it is the real reason not to bother, but it’s trivial to track who downloaded the file and then hit something they were asked not to.

Like, 10 minutes work to do right. You don’t need js to do it at all.