https://github.com/austin-weeks/miasma #innovation #featureupdate #codinghumor #HackerNews #ngated
Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source
🔗 https://stephvee.ca/blog/updates/the-scraping-problem-is-worse-than-i-thought/
The extreme amount of unethical #scraping that's occurring all over the web right now *definitely* won't be solved by limiting nice features for good-faith visitors; for that reason, I've reinstated my full-text RSS feed. Apologies for truncating it in the first place -- that was dumb of me. More about that in my latest blog post.
No outages in the latest Apache logs. However, there is plenty of suspicious activity.
The log has 16,033 lines.
Of these, 1,559 lines feature the "RecentChanges" function for my wikis. Which is something regular users _might_ call up from time to time, but I suspect that #scrapers are the more likely culprits.
The vast majority of these requests come from a random assortment of IP addresses, and they usually end with something on the lines of:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
So yeah, "anonymous bot nets scraping the Interwebs for nefarious purposes" would be by first guess.
Army of Bots
For some months now I have a simple detection against "bad" bots in place. Bots that scrape *everything* they find and very likely are vacuuming all the contents they get to feed the data grinders that train the LLMs of the world. Bots that not only ignore the "robots.txt" protocol, but actively see entries in the robots.txt file as an invitation to visit the contents that are listed there as "disallowed".
I always had a hunch that stating addresses in a publicy reachable text file and flagging those as "please stay out of there" wasn't the best idea, but well, it was the only thing we've got back in the days where the only bots out there were the crawlers of the search engines.
(…) There are two important considerations when using /robots.txt:Now with all the content-sucking and scraping that the "AI" corporations let lose on the web, it is not unusual to haver a massive spike in bot-related visits even in the personal-website-space. And those scrapers are ruthless, they hammer the servers in high frequency and repeatetly, and are killing the web as we know it along the way.
(…) Many of these scrapers are so sophisticated that it is hard, or impossible, to detect them in action. They often ignore the websites’ programmatic pleas not to be scraped, and are known to hit the more fragile parts of a website repeatedly. opendemocracy.netI created a directory with a random name in the top-level of my website.
I then added this directory in the robots.txt file with a disallow. This directory is not linked anywhere. Its name is so random and cryptical that it is highly unlikely that a "name guessing" bot will find it (like those exploit-searching idiot scripts that hammer on "wp-admin" or "typo3" URLs even on sites that don't use WordPress or TYPO3…). Inside the directory is a index script that
a) sends me an email,
b) logs the visit with user-agent-string and IP address and
c) saves the data in a nosql db.
In front of my website I have a script that will check the current visitor's IP address against the nosql and if the IP matches, a HTTP 403 status is served.
Here's a best-of user agent strings that recently "visited" my hidden dir.
That last one is superb, considering that this one alone is several times in my log, of course with a different IP each time:
PetalBot
Googlebot/2.1
Claude-SearchBot/1.0
Thinkbot/0 +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.
Plus, there's a load more that pretend to be "normal" web browsers, of course. 🙄
It is a crude, a symbolical fist shaking yelling at clouds kind-of thing, especially compared to the things that Matthias Ott shared in his post, but it is better than nothing.
Dear friends,
You may have noticed that our website is often unavailable. We are facing a massive load from anonymous scrapers, with over 30,000 IP addresses sending requests every day. We are fighting against AI bots! To find out more: https://velvetyne.fr/news/they-are-trying-to-kill-the-free-web/
Long Time No Post
A traditional rant as one would expect for this kind of delay@puniko after all, @MattKC / #MattKC got #DDoS'd by #ByteDance (the creators of #TikTok) despite paying #ClownFlare protection money.
