Mastodawn

K. Ryabitsev-Prime 🍁Jan 21, 2025

@corbet @LWN I feel your pain so much right now.

NEPŘÁTELSKÉ EMOCE 🇺🇦🇨🇿Jan 21, 2025

@monsieuricon @LWN @corbet are you implying that there are models that are busy being trained to call someone a fuckface over misunderstanding of some obscure arm coprocessor register or respond with viro insults to the most unsuspecting victims?

Show thread

Jonathan Corbet Jan 21, 2025

@lkundrak @monsieuricon @LWN It's a service we provide :)

Show thread

NEPŘÁTELSKÉ EMOCE 🇺🇦🇨🇿Jan 21, 2025

@corbet @LWN @monsieuricon it's not the copilot we need but it's the copilot we deserve

Show thread

Geert Uytterhoeven Jan 23, 2025

@monsieuricon @LWN @corbet Is this why lore is so slow today?

Show thread

Greg Harvey 🌍Jan 21, 2025

@corbet 100% agree. Hosting company MD here, we've seen a massive uptick in AI bullshit. And they don't even respect robots.txt like the better search engines do.

Show thread

Mythic Beasts Jan 21, 2025

@corbet @LWN in our experience you should prepare for thousands of distinct IPs.

Show thread

Jonathan Corbet Jan 21, 2025

@beasts @LWN We are indeed seeing that sort of pattern; each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming. Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick.

Show thread

halfa Jan 21, 2025

@corbet @LWN @beasts JS challenges somewhat works, at the cost of accessibility for JS-free browser

Show thread

ZXGuesser Jan 21, 2025

@corbet @LWN @beasts
Large amounts coming from Huawei Cloud asns and trying to spider every possible GET parameter?

Show thread

Ric Jan 22, 2025

@corbet @LWN @beasts https://ip-tool.qweb.co.uk has buttons for generating htaccess, nginx, and iptables configs for entire network blocks. Just paste a malicious IP in, tap the htaccess button, and paste into your htaccess file, for example.

Also helps to have Fail2Ban set up to firewall anything that triggers too many 403s, so that htaccess blocks on one site become server wide blocks protecting all sites.

My general rate limiter for nginx is useful too: https://github.com/qwebltd/Useful-scripts/blob/main/Bash%20scripts%20for%20Linux/nginx-rate-limiting.conf

IP Tool

A handy IP lookup tool to find your own IP address, or information about any IPv4 or IPv6 IP. Hostname, CIDR range, AS, ASN, country and city geolocation, and other IPs owned by the same host.

Show thread

nirik

Jan 21, 2025

@corbet @LWN same here in #fedora infra. I had to block a bunch this morning to keep pagure.io usable. 😢

Show thread

i.grok Jan 22, 2025

@nirik @corbet @LWN perhaps you can start sharing lists?

Show thread

John Francis 🇨🇦🦫🍁💪⬆️Jan 21, 2025

@corbet @LWN sounds like you need an AI poisoner like Nerpenthes or iocaine.

Show thread

Jonathan Corbet Jan 21, 2025

@johnefrancis @LWN Something like nepenthes (https://zadzmo.org/code/nepenthes/) has crossed my mind; it has its own risks, though. We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN. Tempting.

Nepenthes - ZADZMO.org

Making web crawlers eat shit since 2023

Show thread

army of me Jan 21, 2025

@corbet @johnefrancis @LWN I'm dealing with a similar issue now (though likely at a smaller scale than LWN!), and I found that leading crawlers into a maze helped a lot in discovering UAs and IP ranges that misbehave. Anyone who spends an unreasonable time in the maze gets rate limited, and served garbage.

So far, the results are very good. I can recommend a similar strategy.

Happy to share details and logs, and whatnot, if you're interested. LWN is a fantastic resource, and AI crawlers don't deserve to see it.

Show thread

Ben Zanin Jan 22, 2025

@algernon @corbet @johnefrancis @LWN thank you for offering to protect a thing I love

Show thread

Kaleissin Jan 22, 2025

@algernon @corbet @johnefrancis @LWN Write-up yes, please!

Show thread

chebra Jan 22, 2025

@algernon @corbet @johnefrancis @LWN Can it be turned into a traefik middleware and released under AGPL on codeberg?

Show thread

chebra Jan 22, 2025

@corbet @johnefrancis @LWN You could also make this structure of pages several levels deep and once you are at a level where no living human would reasonably go just automatically add that IP to the blocklist (and share it with others).

Show thread

🐈‍⬛David Sommerseth Jan 23, 2025

We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN.

What are you waiting for, @corbet? 😉

@LWN

Show thread

bignose Jan 21, 2025

Thank you @corbet and all at @LWN for continuing the work of providing the excellent #LWN.

The "active defenses" against torrents of antisocial web scraping bots, has bad effects on users. They tend to be "if you don't allow JavaScript and cookies, you can't visit the site" even if the site itself works fine without.

I don't have a better defense to offer, but it's really closing off huge portions of the web that would otherwise be fine for secure browsers.

It sucks. Sorry, and thank you.

Show thread

Jonathan Corbet Jan 21, 2025

@bignose @LWN We have gone far out of our way to never require JavaScript to read LWN; we're not going back on that now.

Show thread

Sheogorath Jan 21, 2025

@corbet @LWN I think we should start doing what the internet can do best: Collaborate on these things.

I see this on my services, Xe recently saw the same. https://xeiaso.net/notes/2025/amazon-crawler/ (and build a solution https://xeiaso.net/blog/2025/anubis/)

There is https://zadzmo.org/code/nepenthes/

I would love to see some kind of effort to map out bot IPs and get a public block list. I'm tired of their nonsense.

Amazon's AI crawler is making my git server unstable

Please, just stop.

Show thread

Justin Mason Jan 22, 2025

@sheogorath @corbet @LWN I agree.... this is very similar to the early days of antispam, IMO. I wonder if there's a way to detect abusive scraping (via hits on hidden links, etc.) and publish to a shared DNS blocklist?

Show thread

Grant Stephens Jan 21, 2025

@corbet @LWN @devs can we help?

Show thread

Adelie Jan 21, 2025

@corbet @LWN

"Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "

if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).

Just set it up last night for my much-smaller-traffic instances, feel free to DM

https://ipset.netfilter.org/

IP sets

ipset

Show thread

Jonathan Corbet Jan 21, 2025

@adelie @LWN Blocking a subnet is not hard; the harder part is figuring out *which* subnets without just blocking huge parts of the net as a whole.

Show thread

K. Ryabitsev-Prime 🍁Jan 21, 2025

@corbet @adelie @LWN I have been using pyasn to block entire subnets. It's effective, but only in the same way carpet bombing is. I'm sure I've blocked legitimate systems, but c'est la vie.

Show thread

Adelie Jan 21, 2025

@corbet @LWN

Probably a good question for the fedi as a whole. I started with any 40x response in my logs, added any spamhaus hits from my mail server, and any user-agents with "bot" in the name. Plus facebook in particular has huge ipv4 blocks just for scraping, also easy to block.

Show thread

Adelie Jan 21, 2025

@corbet @LWN

Also tarpits! And nepenthes and nepenthes-adjacent tech!

https://tldr.nettime.org/@asrg/113867412641585520

https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62fdded3

ASRG (@[email protected])

Attached: 1 image ## **Sabot in the Age of AI** A list of offensive methods & strategic approaches for facilitating (algorithmic) sabotage, framework disruption, & intentional data poisoning. ### **Selected Tools & Frameworks** - **Nepenthes** — [Endless crawler trap.](https://zadzmo.org/code/nepenthes) - **Babble** — [Standalone LLM crawler tarpit.](https://git.jsbarretto.com/zesterer/babble) - **Markov Tarpit** — [Traps AI bots & feeds them useless data.](https://git.rys.io/libre/markov-tarpit) - **Sarracenia** — [Loops bots into fake pages.](https://github.com/CTAG07/Sarracenia) - **Antlion** — [Express.js middleware for infinite sinkholes.](https://github.com/shsiena/antlion) - **Infinite Slop** — [Garbage web page generator.](https://code.blicky.net/yorhel/infinite-slop) - **Poison the WeLLMs** — [Reverse proxy for LLM confusion.](https://codeberg.org/MikeCoats/poison-the-wellms) - **Marko** — [Dissociated Press CLI/lib.](https://codeberg.org/timmc/marko/) - **django-llm-poison** — [Serves poisoned content to crawlers.](https://github.com/Fingel/django-llm-poison) - **konterfAI** — [Model-poisoner for LLMs.](https://codeberg.org/konterfai/konterfai) - **Quixotic** — [Static site LLM confuser.](https://marcusb.org/hacks/quixotic.html) - **toxicAInt** — [Replaces text with slop.](https://github.com/portasynthinca3/toxicaint) - **Iocaine** — [Defense against unwanted scrapers.](https://iocaine.madhouse-project.org) - **Caddy Defender** — [Blocks bots & pollutes training data.](https://defender.jasoncameron.dev) - **GzipChunk** — [Inserts compressed junk into live gzip streams.](https://github.com/gw1urf/gzipchunk) - **Chunchunmaru** — [Go-based web scraper tarpit.](https://github.com/BrandenStoberReal/Chunchunmaru) - **IED** — [ZIP bombs for web scrapers.](https://github.com/NateChoe1/ied) - **FakeJPEG** — [Endless fake JPEGs.](https://github.com/gw1urf/fakejpeg) - **Pyison** — [AI crawler tarpit.](https://github.com/JonasLong/Pyison) - **HalluciGen** — [WP plugin that scrambles content.](https://codeberg.org/emergentdigitalmedia/HalluciGen) - **Spigot** — [Hierarchical Markov page generator.](https://github.com/gw1urf/spigot) --- *This is a living resource — regularly updated to reflect the shifting terrain of collective techno-disobedience and algorithmic Luddism.*

tldr.nettime

Show thread

Adelie Jan 21, 2025

@corbet @LWN You know, what we need is a clearinghouse for this like there are for piholes and porn and such. Could someone with some followers get #AIblacklist trending?

Post your subnets with that hashtag. If we get any traction, I'll host the list.

Show thread

Ronny Adsetts Jan 21, 2025

@corbet @LWN would you be so kind as to write up whatever mitigations you come up with? I've been fighting this myself on our websites. You seeing semi-random user agents too?

Show thread

Jonathan Corbet Jan 21, 2025

@RonnyAdsetts @LWN The user agent field is pure fiction for most of this traffic.

Show thread

AndresFreundTec Jan 21, 2025

@corbet @LWN Do you see a lot of pointlessly redundant requests? I see a lot of related-seeming IPs request the same pages over and over.

Show thread

Jonathan Corbet Jan 21, 2025

@AndresFreundTec @LWN Yes, a lot of really silly traffic. About 1/3 of it results in redirects from bots hitting port 80; you don't see them coming back with TLS, they just keep pounding their head against the same wall.

It is weird; somebody has clearly put some thought into creating a distributed source of traffic that avoid tripping the per-IP circuit breakers. But the rest of it is brainless.

Show thread

Henrik Bakken Jan 21, 2025

@corbet @LWN @AndresFreundTec Maybe the bot wrote the code itself?

Show thread

EQ Jan 21, 2025

@corbet @LWN @AndresFreundTec
Could it be this chatgpt flaw:

https://www.theregister.com/2025/01/19/openais_chatgpt_crawler_vulnerability/

OpenAI's ChatGPT crawler can be tricked into DDoSing sites, answering your queries

The S in LLM stands for Security

The Register

Show thread

The Doctor Jan 22, 2025

@corbet @LWN @AndresFreundTec Maybe it's sabotage internally, so it's not /quite/ as bad. That's what I'd do.

Show thread

Firstyear Jan 21, 2025

@corbet @LWN I was just reading about https://git.madhouse-project.org/algernon/iocaine

iocaine

The deadliest poison known to AI.

MadHouse Git Repositories

Show thread

SpaceLifeForm Jan 21, 2025

@corbet @LWN

Check out Nepenthes in defensive mode.

Show thread

ferricoxide Jan 21, 2025

@[email protected] @LWN

Time to set up AI-poisoning bots.

Really great part of this BS is that if you're not a hyperscale social media platform, your ability to afford adequate defenses is going to be awful.

Show thread

irelephant Jan 21, 2025

@[email protected] @LWN Sounds awful. You should consider setting up something like cloudflare or deflect.

Show thread

Andreas Jan 21, 2025

@corbet

> Sabot in the Age of AI
> Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

https://tldr.nettime.org/@asrg/113867412641585520

@LWN @renchap

ASRG (@[email protected])

Attached: 1 image ## **Sabot in the Age of AI** A list of offensive methods & strategic approaches for facilitating (algorithmic) sabotage, framework disruption, & intentional data poisoning. ### **Selected Tools & Frameworks** - **Nepenthes** — [Endless crawler trap.](https://zadzmo.org/code/nepenthes) - **Babble** — [Standalone LLM crawler tarpit.](https://git.jsbarretto.com/zesterer/babble) - **Markov Tarpit** — [Traps AI bots & feeds them useless data.](https://git.rys.io/libre/markov-tarpit) - **Sarracenia** — [Loops bots into fake pages.](https://github.com/CTAG07/Sarracenia) - **Antlion** — [Express.js middleware for infinite sinkholes.](https://github.com/shsiena/antlion) - **Infinite Slop** — [Garbage web page generator.](https://code.blicky.net/yorhel/infinite-slop) - **Poison the WeLLMs** — [Reverse proxy for LLM confusion.](https://codeberg.org/MikeCoats/poison-the-wellms) - **Marko** — [Dissociated Press CLI/lib.](https://codeberg.org/timmc/marko/) - **django-llm-poison** — [Serves poisoned content to crawlers.](https://github.com/Fingel/django-llm-poison) - **konterfAI** — [Model-poisoner for LLMs.](https://codeberg.org/konterfai/konterfai) - **Quixotic** — [Static site LLM confuser.](https://marcusb.org/hacks/quixotic.html) - **toxicAInt** — [Replaces text with slop.](https://github.com/portasynthinca3/toxicaint) - **Iocaine** — [Defense against unwanted scrapers.](https://iocaine.madhouse-project.org) - **Caddy Defender** — [Blocks bots & pollutes training data.](https://defender.jasoncameron.dev) - **GzipChunk** — [Inserts compressed junk into live gzip streams.](https://github.com/gw1urf/gzipchunk) - **Chunchunmaru** — [Go-based web scraper tarpit.](https://github.com/BrandenStoberReal/Chunchunmaru) - **IED** — [ZIP bombs for web scrapers.](https://github.com/NateChoe1/ied) - **FakeJPEG** — [Endless fake JPEGs.](https://github.com/gw1urf/fakejpeg) - **Pyison** — [AI crawler tarpit.](https://github.com/JonasLong/Pyison) - **HalluciGen** — [WP plugin that scrambles content.](https://codeberg.org/emergentdigitalmedia/HalluciGen) - **Spigot** — [Hierarchical Markov page generator.](https://github.com/gw1urf/spigot) --- *This is a living resource — regularly updated to reflect the shifting terrain of collective techno-disobedience and algorithmic Luddism.*

tldr.nettime

Show thread

Fonant Jan 21, 2025

@corbet @LWN Yup, my servers too. Sometimes GPTBot as the UserAgent, but often not.

The AI bullshit merchants are slowly killing the web.

Show thread

Johannes Hentschel Jan 21, 2025

@corbet
In my timeline your post appeared directlt beneath this one https://tldr.nettime.org/@asrg/113867412641585520 Coincidence????
@LWN

ASRG (@[email protected])

Attached: 1 image ## **Sabot in the Age of AI** A list of offensive methods & strategic approaches for facilitating (algorithmic) sabotage, framework disruption, & intentional data poisoning. ### **Selected Tools & Frameworks** - **Nepenthes** — [Endless crawler trap.](https://zadzmo.org/code/nepenthes) - **Babble** — [Standalone LLM crawler tarpit.](https://git.jsbarretto.com/zesterer/babble) - **Markov Tarpit** — [Traps AI bots & feeds them useless data.](https://git.rys.io/libre/markov-tarpit) - **Sarracenia** — [Loops bots into fake pages.](https://github.com/CTAG07/Sarracenia) - **Antlion** — [Express.js middleware for infinite sinkholes.](https://github.com/shsiena/antlion) - **Infinite Slop** — [Garbage web page generator.](https://code.blicky.net/yorhel/infinite-slop) - **Poison the WeLLMs** — [Reverse proxy for LLM confusion.](https://codeberg.org/MikeCoats/poison-the-wellms) - **Marko** — [Dissociated Press CLI/lib.](https://codeberg.org/timmc/marko/) - **django-llm-poison** — [Serves poisoned content to crawlers.](https://github.com/Fingel/django-llm-poison) - **konterfAI** — [Model-poisoner for LLMs.](https://codeberg.org/konterfai/konterfai) - **Quixotic** — [Static site LLM confuser.](https://marcusb.org/hacks/quixotic.html) - **toxicAInt** — [Replaces text with slop.](https://github.com/portasynthinca3/toxicaint) - **Iocaine** — [Defense against unwanted scrapers.](https://iocaine.madhouse-project.org) - **Caddy Defender** — [Blocks bots & pollutes training data.](https://defender.jasoncameron.dev) - **GzipChunk** — [Inserts compressed junk into live gzip streams.](https://github.com/gw1urf/gzipchunk) - **Chunchunmaru** — [Go-based web scraper tarpit.](https://github.com/BrandenStoberReal/Chunchunmaru) - **IED** — [ZIP bombs for web scrapers.](https://github.com/NateChoe1/ied) - **FakeJPEG** — [Endless fake JPEGs.](https://github.com/gw1urf/fakejpeg) - **Pyison** — [AI crawler tarpit.](https://github.com/JonasLong/Pyison) - **HalluciGen** — [WP plugin that scrambles content.](https://codeberg.org/emergentdigitalmedia/HalluciGen) - **Spigot** — [Hierarchical Markov page generator.](https://github.com/gw1urf/spigot) --- *This is a living resource — regularly updated to reflect the shifting terrain of collective techno-disobedience and algorithmic Luddism.*

tldr.nettime

Show thread

vga256 Jan 21, 2025

@corbet @LWN i'm not sure if you've already got a strategy for dealing with the scrapers already in mind, but if not -

dialup.cafe's running on nginx, and this has worked well for me so far:
https://rknight.me/blog/blocking-bots-with-nginx/

an apache translation of that using .htaccess would be possible as well.

Blocking Bots with Nginx

How I've automated updating the bot list to block access to my site

Show thread

Justin Mason Jan 21, 2025

@corbet @LWN it's disgusting to find the LLM companies using these disguised scraping practices. Clearly they recognise that they are acting abusively

Show thread

sree Jan 21, 2025

@corbet @LWN Looking forward to the Grumpy Editor article on dealing with AI scraping bots!

Show thread

Od-Nan Kenobi Jan 21, 2025

@corbet @LWN the amount of disrespect AI companies have regarding the web is outrageous and very dangerous, especially considering the web is exactly the essence from which they draw sustenance. As very accurately described by Freya Holmér in [this video](https://youtu.be/-opBifFfsMY?feature=shared), it's just like a parasitic cancer - it leeches off the very thing it requires to grow, and its metric of success is how well it can deceive us into believe it is human, be it on generating content or sucking bandwidth

Generative AI is a Parasitic Cancer

YouTube

Show thread

Seb-Solon Jan 22, 2025

@corbet
I'll be happy to hear about the solutions you end up finding / or not

Good luck on that matter.
@LWN

Show thread

Seb-Solon Jan 22, 2025

@corbet
This just came into my timeline, in case it helps : https://tldr.nettime.org/@asrg/113867412641585520
@LWN

ASRG (@[email protected])

Attached: 1 image ## **Sabot in the Age of AI** A list of offensive methods & strategic approaches for facilitating (algorithmic) sabotage, framework disruption, & intentional data poisoning. ### **Selected Tools & Frameworks** - **Nepenthes** — [Endless crawler trap.](https://zadzmo.org/code/nepenthes) - **Babble** — [Standalone LLM crawler tarpit.](https://git.jsbarretto.com/zesterer/babble) - **Markov Tarpit** — [Traps AI bots & feeds them useless data.](https://git.rys.io/libre/markov-tarpit) - **Sarracenia** — [Loops bots into fake pages.](https://github.com/CTAG07/Sarracenia) - **Antlion** — [Express.js middleware for infinite sinkholes.](https://github.com/shsiena/antlion) - **Infinite Slop** — [Garbage web page generator.](https://code.blicky.net/yorhel/infinite-slop) - **Poison the WeLLMs** — [Reverse proxy for LLM confusion.](https://codeberg.org/MikeCoats/poison-the-wellms) - **Marko** — [Dissociated Press CLI/lib.](https://codeberg.org/timmc/marko/) - **django-llm-poison** — [Serves poisoned content to crawlers.](https://github.com/Fingel/django-llm-poison) - **konterfAI** — [Model-poisoner for LLMs.](https://codeberg.org/konterfai/konterfai) - **Quixotic** — [Static site LLM confuser.](https://marcusb.org/hacks/quixotic.html) - **toxicAInt** — [Replaces text with slop.](https://github.com/portasynthinca3/toxicaint) - **Iocaine** — [Defense against unwanted scrapers.](https://iocaine.madhouse-project.org) - **Caddy Defender** — [Blocks bots & pollutes training data.](https://defender.jasoncameron.dev) - **GzipChunk** — [Inserts compressed junk into live gzip streams.](https://github.com/gw1urf/gzipchunk) - **Chunchunmaru** — [Go-based web scraper tarpit.](https://github.com/BrandenStoberReal/Chunchunmaru) - **IED** — [ZIP bombs for web scrapers.](https://github.com/NateChoe1/ied) - **FakeJPEG** — [Endless fake JPEGs.](https://github.com/gw1urf/fakejpeg) - **Pyison** — [AI crawler tarpit.](https://github.com/JonasLong/Pyison) - **HalluciGen** — [WP plugin that scrambles content.](https://codeberg.org/emergentdigitalmedia/HalluciGen) - **Spigot** — [Hierarchical Markov page generator.](https://github.com/gw1urf/spigot) --- *This is a living resource — regularly updated to reflect the shifting terrain of collective techno-disobedience and algorithmic Luddism.*

tldr.nettime

Show thread

Carl Schwan

Jan 22, 2025

@corbet @LWN Same for KDE gitlab instance. It's a pain :(

Show thread

Furbland's Very Cool Mastodon™Jan 22, 2025

@corbet @LWN I know Cloudflare has some fashion of AI-blocking doodad, it might be worth looking into that?

Show thread

Alan Langford 🇨🇦🧤🧊摏 Jan 22, 2025

@corbet @LWN I have resorted to a wide swath of blocks, in Bytedance's case, blocking entire ASN's (most recently all of Meta). Other wide blocks are on the user agent. Ironically my big load spikes are now from a huge number of servers running ActivityPub whenever one of my sites is linked to!

Show thread

Alan Langford 🇨🇦🧤🧊摏 Jan 22, 2025

@corbet @LWN I should mention that the ASN is available through ipinfo.io. If you're working in PHP, @abivia has a library for it.

Show thread

The Doctor Jan 22, 2025

@corbet @LWN I'll just leave this here...

https://git.madhouse-project.org/algernon/iocaine

iocaine

The deadliest poison known to AI.

MadHouse Git Repositories

Show thread

Ben Zanin Jan 22, 2025

@corbet @LWN I'm sorry, I know this is a pain in the butt to deal with and that it's kind of demoralizing.

Is there anything I can do to help? I'm already a subscriber, and a very happy one; but if it'll diminish the demoralization at all, I really appreciate that you're tackling this problem. Can I get you a pizza or something?

Show thread

Jessica D Dooley Jan 22, 2025

@corbet @LWN I sympathize, it's an exasperating problem. I've found microcaching all public facing content to be extremely effective.

- The web server sits behind a proxy micro cache
- Preseed the cache by crawling every valid public path
- Configure the valid life of cached content to be as short as you want
- Critically, ensure that every request is always served stale cached content while the cache leisurely repopulates with a fresh copy. This eliminates bot overwhelm by decoupling requests from ANY IP from the request rate hitting the upstream
- Rather than blocking aggressive crawlers, configure rate-limiting customized by profiling max predicted human rate
- For bots with known user agents, plus those detected by profiling their traffic, divert all their requests to a duplicate long lived cache that never invokes the upstream for updates

Micro caching has saved us thousands on compute, and eliminated weekly outages caused by abusive bots. It's open to significant tuning to improve efficiency based on your content.

Shout out to the infrastructure team at [email protected] - a blog post they published 9 years ago (now long gone) described this approach.

#nginx #cache

Show thread

Phracker Jan 22, 2025

@corbet @LWN
Maybe try implementing some sort of Captcha system where a user or user agent has to prove that they're human in order to use the site.