A hacker developed an "infinite maze" to trap web-crawlers/scrapers from AI companies

basically, if the server code detects that a web crawler from an AI firm is trying to scrape the site ...

... the code begins spinning up an infinite, nesting warren of new sham pages, filled with random text

so the crawler gets stuck crawling and scraping endless and meaningless pages

fun @jasonkoebler piece at @404mediaco

https://www.404media.co/email/7a39d947-4a4a-42bc-bbcf-3379f112c999/?ref=daily-stories-newsletter

Developer Creates Infinite Maze That Traps AI Training Bots

"Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself."

404 Media
@clive what a waste from both sides

@gagliardi_vale

yep, I think that's basically the point of it

@clive @gagliardi_vale
Job creation for data annotators in India, Nigeria, Vietnam, etc who have been given the microtasks of removing any junk like this from the AI training data.
@gagliardi_vale @clive this is what we're doing, instead of scrambling to salvage our odds of surviving this century as a species
@clive @jasonkoebler @404mediaco
Can this be used as a method for creating blockdrain creeptocoin con? 🤔

@clive @jasonkoebler @404mediaco

Was just on a thread a week or so ago about what to do with aggressive AI web scrapers that won't self-limit or respect robots.txt.

This is evolution in action.

Nature is healing.

@clive @tbortels @jasonkoebler @404mediaco

It's a practical application of "GIGO".

Ahh, there's a place for everything—and GIGO has finally found its place!

@tbortels Is there even such thing as "non-aggressive AI web scrapers" that will self-limit and respect robots\.txt?

At least google's and micro$hit's ignore robots\.txt. It downloaded photos from my gallery, up to 6000 requests a day, more than once… I bet not even 10 of them are legit users.

I've only 38 photos… stupid bots download the same photos over and over again…

I've blocked 4 IP ranges. It probably includes indexation bots' IP but I don't give an F.

@clive @jasonkoebler @404mediaco

@devnull @clive @jasonkoebler @404mediaco

I felt obligated to disclaim my fantasy well-behaved AI scrapers just in case. The actual headcount there may well be zero.

@tbortels @devnull @clive @jasonkoebler @404mediaco
There is such a thing as a non-aggressive respectful AI scrapper. It's called asking for permission from the copyright owner and obtaining an appropriate license if their AI system can generate derivative works using your content.
https://youtu.be/PeKZvUcr0-M
Suno CEO Disrespectful To Music Creators | Suno Lawsuit Exposed (Lawyer Reacts)

YouTube

@bornach @devnull @clive @jasonkoebler @404mediaco

Alas - those scrapers are out of scope because they're not the ones causing problems and driving this conversation. Indeed - if someone licensed content legitimately, the need to scrape the web would be absent - there are far more efficient ways to say "here are all of the new posts in the last N hours".

You can safely assume any automation ignoring your robots.txt is a pest to be ruthlessly crushed in whatever manner amuses you most.

@tbortels @bornach @devnull @jasonkoebler @404mediaco

yep -- licensing would obviate the hassles of scraping

"here's our API, enjoy"

@404mediaco @clive @jasonkoebler Love this! I built a simple #WordPress plugin that garbles your web content to serve them up garbage:

https://kevinfreitas.net/tools-experiments/

#AI #GPT #LLMs

Tools & Experiments - Kevin Freitas

WordPress Plugins AI Poison Pill [beta] Download v1.0.20240304(will update to use official WordPress.org link once approved/live) Email kevinfreitas.net@gmail.com with any questions or suggestions. The words you write and publish on your website are yours. Instead of blocking AI/LLM scraper bots from stealing your stuff why not poison them with garbage content instead? This plugin scrambles … Continue reading "Tools & Experiments"

Kevin Freitas
@clive @jasonkoebler @404mediaco I love that it's called Nepenthes. One of the coolest plant genera!

@clive @jasonkoebler @404mediaco

I've seen stories about people hosting sites that got hit by robots and they had to pay a bunch of money in data costs. I wonder how this works, if it can help in that regard when the whole point is to keep them pointed at your site.

I'm all for wasting their time, i just wonder how much it costs.

@RnDanger @clive @jasonkoebler @404mediaco yeah, you’d have to host this on a service that doesn’t charge by network traffic

@clive

Finally the equivalent of the mail tar pit!

Hooray!

@clive @jasonkoebler @404mediaco Tip of the Cub cap to the hacker!
@clive @jasonkoebler @404mediaco Are we really getting Barrier Mazes from Ghost In the Shell??
@clive I once made a webpage that would continually slowly send random words and links to itself, never quite closing the connection. It's honestly not worth the trouble. It'd be nice if it interfered with AI training, though.
@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.
@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.

@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.
@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.

@clive @jasonkoebler @404mediaco

Might be nice to add something to poison the data, contradictory statements, things that break the tokenizer, maybe subtle statistical tricks to inject gnarly statements.

@clive @jasonkoebler @404mediaco Brilliant idea! 😂 I can just imagine AI scrapers struggling to process an endless stream of random pages. It's like trolling on level 80 — mad respect to the hacker for the creativity!
@clive @jasonkoebler @404mediaco This makes me wonder if it would be possible to insert garbage into rendered HTML (to confuse bots) and something like Nightshade into the rendered page (to poison image downloading and screenshot OCR) both in ways that aren't distracting to human readers.
@clive @jasonkoebler @404mediaco
This is how we defeat Skynet.
If you are hearing this message, you are the resistance.
@clive @jasonkoebler @404mediaco there was a time when people wanted their pages to be scraped and indexed. Balkanization of the Web. The battle for hegemony of information. Now we're injecting poison into the process. It's like chemotherapy.
@Qbitzerre @clive @jasonkoebler @404mediaco Indeed a good analogy, to get rid of the cancer that LLM trainingsets are to copyright.
@clive @jasonkoebler @404mediaco Daisy Daisy give me your ans w e r d o o