Can you guess when we deployed some bot-blocking middleware to #FromThePage?
attn @ewg118
@benwbrum I reported to Amazon, but Amazon has poured billions of dollars into that AI company, so I don't expect much activity. Someone speculated it's what brought down the Internet archive a week or two ago
@benwbrum What are you using? We’ve been playing whack-a-mole with AI scrapers :/

@anindita We run a Ruby on Rails stack and use the rack-attack gem, which worked very nicely.

Here's our configuration, which is pretty heavy-handed: https://github.com/benwbrum/fromthepage/blob/development/config/initializers/rack_attack.rb

(I would actually like to allow FromThePage public collections to be scraped by LLM spiders, but not at the cost of bringing down the server for everyone.)

fromthepage/config/initializers/rack_attack.rb at development · benwbrum/fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents. - benwbrum/fromthepage

GitHub
@benwbrum Hah! Thanks :)
@anindita Our friends at the American Archive of Public Broadcasting--whose servers were actually brought down by these bots--did filtering at the Apache level, and I can probably get them to share their configurations if you're interested.
@benwbrum We’re looking into a possible Cloudflare solution as it’s gotten pretty brutal recently. I’ll check with Lorin and get back to you — I appreciate the offer!
@anindita It was brutal for us as well. Very nearly took us down for a week solid, due to two different (and apparently independent) spiders.
@benwbrum Ugh — yes. This is very familiar.