Mastodawn

Valentino Gagliardi Jan 23

@clive what a waste from both sides

@gagliardi_vale

yep, I think that's basically the point of it

Bornach Jan 24

@clive @gagliardi_vale
Job creation for data annotators in India, Nigeria, Vietnam, etc who have been given the microtasks of removing any junk like this from the AI training data.

@bornach @gagliardi_vale

yes indeed

potpie Jan 23

@gagliardi_vale @clive this is what we're doing, instead of scrambling to salvage our odds of surviving this century as a species

Valentino Gagliardi Jan 24

@potpie @clive right?

quangobaud Jan 23

@clive @jasonkoebler @404mediaco
Can this be used as a method for creating blockdrain creeptocoin con? 🤔

Tom Bortels Jan 23

Was just on a thread a week or so ago about what to do with aggressive AI web scrapers that won't self-limit or respect robots.txt.

This is evolution in action.

Nature is healing.

@tbortels @jasonkoebler @404mediaco

it's pretty wild

@clive @tbortels @jasonkoebler @404mediaco

Chris Real Jan 23

It's a practical application of "GIGO".

Ahh, there's a place for everything—and GIGO has finally found its place!

🐧DaveNull🐧 ☣️pResident Evil☣Jan 23

@tbortels Is there even such thing as "non-aggressive AI web scrapers" that will self-limit and respect robots\.txt?

At least google's and micro$hit's ignore robots\.txt. It downloaded photos from my gallery, up to 6000 requests a day, more than once… I bet not even 10 of them are legit users.

I've only 38 photos… stupid bots download the same photos over and over again…

I've blocked 4 IP ranges. It probably includes indexation bots' IP but I don't give an F.

@devnull @clive @jasonkoebler @404mediaco

Tom Bortels Jan 23

I felt obligated to disclaim my fantasy well-behaved AI scrapers just in case. The actual headcount there may well be zero.

Bornach Jan 24

@tbortels @devnull @clive @jasonkoebler @404mediaco
There is such a thing as a non-aggressive respectful AI scrapper. It's called asking for permission from the copyright owner and obtaining an appropriate license if their AI system can generate derivative works using your content.
https://youtu.be/PeKZvUcr0-M

Suno CEO Disrespectful To Music Creators | Suno Lawsuit Exposed (Lawyer Reacts)

YouTube

@bornach @devnull @clive @jasonkoebler @404mediaco

Tom Bortels Jan 24

Alas - those scrapers are out of scope because they're not the ones causing problems and driving this conversation. Indeed - if someone licensed content legitimately, the need to scrape the web would be absent - there are far more efficient ways to say "here are all of the new posts in the last N hours".

You can safely assume any automation ignoring your robots.txt is a pest to be ruthlessly crushed in whatever manner amuses you most.

@tbortels @bornach @devnull @jasonkoebler @404mediaco

yep -- licensing would obviate the hassles of scraping

"here's our API, enjoy"

@tbortels @devnull @jasonkoebler @404mediaco

yeah

@devnull @tbortels @jasonkoebler @404mediaco

bleah, what a mess!

https://kevinfreitas.net/tools-experiments/

Kevin Freitas Jan 23

@404mediaco @clive @jasonkoebler Love this! I built a simple #WordPress plugin that garbles your web content to serve them up garbage:

#AI #GPT #LLMs

Tools & Experiments - Kevin Freitas

WordPress Plugins AI Poison Pill [beta] Download v1.0.20240304(will update to use official WordPress.org link once approved/live) Email kevinfreitas.net@gmail.com with any questions or suggestions. The words you write and publish on your website are yours. Instead of blocking AI/LLM scraper bots from stealing your stuff why not poison them with garbage content instead? This plugin scrambles … Continue reading "Tools & Experiments"

Kevin Freitas

@KevinFreitas @404mediaco @jasonkoebler

oh damn that is cool

Lucas C. Wheeler Jan 23

@clive @jasonkoebler @404mediaco I love that it's called Nepenthes. One of the coolest plant genera!

@lcwheeler @jasonkoebler @404mediaco

yessss

2xfo Jan 23

I've seen stories about people hosting sites that got hit by robots and they had to pay a bunch of money in data costs. I wonder how this works, if it can help in that regard when the whole point is to keep them pointed at your site.

I'm all for wasting their time, i just wonder how much it costs.

Lord Tom Klopf of CZ

Jan 23

@RnDanger @clive @jasonkoebler @404mediaco yeah, you’d have to host this on a service that doesn’t charge by network traffic

OCTADE Jan 23

@RnDanger@infosec.exchange @clive@saturation.social @jasonkoebler@mastodon.social @404mediaco@mastodon.social

Employ bandwidth throttling at about 16K with a few hundred thousand link trees to follow. That will really teach them and save your bandwidth bill.

Luna chan Jan 23

@octade @thomas_klopf @RnDanger @clive @jasonkoebler @404mediaco Even better.

@thomas_klopf @RnDanger @jasonkoebler @404mediaco

true

@RnDanger @jasonkoebler @404mediaco

yeah good question!

Finally the equivalent of the mail tar pit!

Hooray!

gdtrfb57 Jan 23

@clive @jasonkoebler @404mediaco Tip of the Cub cap to the hacker!

Meercat ✅  Jan 23

@clive @jasonkoebler @404mediaco good idea👍

Kamikaze Jan 23

@clive @jasonkoebler @404mediaco Are we really getting Barrier Mazes from Ghost In the Shell??

@Kamikaze @jasonkoebler @404mediaco

it would appear so

Michael Hartle Jan 23

@clive @jasonkoebler @404mediaco There are a number of "infinite maze" generators like #Nepenthes (https://zadzmo.org/code/nepenthes/) or #Iocaine (https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) that help #poisonthewell for AI companies training their LLMs on your content, complete with guides on integration with #Caddy (https://pages.madhouse-project.org/algernon/infrastructure.org/common_services_caddy_snippets_poison_ai)

ZADZMO code

@mhartle @jasonkoebler @404mediaco

aha, damn interesting

Janneke Jan 23

@clive @jasonkoebler @404mediaco
Cc: @corbet

Luna chan Jan 23

@clive @jasonkoebler @404mediaco What a great idea.

Skybrook Jan 23

@clive I once made a webpage that would continually slowly send random words and links to itself, never quite closing the connection. It's honestly not worth the trouble. It'd be nice if it interfered with AI training, though.

Frank Heijkamp Jan 23

@skybrook @clive The idea is that a human will have to review it and flips a switch that will exclude the entire site. This exclusion will keep the actual content on the site safe from being ingested.

Skybrook Jan 23

@alterelefant That's tricky, since the crawlers have vast amounts of IP addresses. I just set traps to detect web spiders automatically, if traffic gets to be a problem.

Frank Heijkamp Jan 24

@skybrook Don't filter by IP-address, but filter by behavior. I know, that's sometimes easier said than done.

The following one is straight forward. A get request to a bogus link in the infinit labyrinth qualifies for a labyrinth response, whether the IP-address is known or a new one.

With a labyrinth response I would throw in a random delay between 100 ms and 5 s, and a one in fifty chance of a 30 s delay before responding with a http 503. That should usually be enough to slow down crawlers.

Skybrook Jan 24

@alterelefant Well right, that's what I meant by "traps to detect." I didn't think of setting it so every URL for any detected IP address would become a labyrinth response... not a bad idea really.

Frank Heijkamp Jan 24

@skybrook Crawler that use multiple endpoints to distribute the crawl load will handout urls to be crawled to those endpoints. Their freshly acquired labyrinth links will make a new endpoint immediately identifiable.

Dr Power Nap, DDS ✅️Jan 23

Might be nice to add something to poison the data, contradictory statements, things that break the tokenizer, maybe subtle statistical tricks to inject gnarly statements.

@ThePowerNap @clive @jasonkoebler @404mediaco

𝓐𝓷𝓭𝔂𝓣𝓲𝓮𝓭𝔂𝓮 𓀤 Jan 29

For that we have Nightshade. https://nightshade.cs.uchicago.edu/whatis.html

Damiano Gacík Jan 23

@clive @jasonkoebler @404mediaco Brilliant idea! 😂 I can just imagine AI scrapers struggling to process an endless stream of random pages. It's like trolling on level 80 — mad respect to the hacker for the creativity!

shoop Jan 23

@clive @jasonkoebler @404mediaco This makes me wonder if it would be possible to insert garbage into rendered HTML (to confuse bots) and something like Nightshade into the rendered page (to poison image downloading and screenshot OCR) both in ways that aren't distracting to human readers.

📄 Mehdi.doc Jan 23

@clive @jasonkoebler @404mediaco not all heroes wear cape

David Grieve Jan 23

@clive @jasonkoebler @404mediaco
This is how we defeat Skynet.
If you are hearing this message, you are the resistance.

David B. Himself Jan 23

@clive @jasonkoebler @404mediaco Yes, please.,

2¢Jan 23

@clive @jasonkoebler @404mediaco there was a time when people wanted their pages to be scraped and indexed. Balkanization of the Web. The battle for hegemony of information. Now we're injecting poison into the process. It's like chemotherapy.

Frank Heijkamp Jan 23

@Qbitzerre @clive @jasonkoebler @404mediaco Indeed a good analogy, to get rid of the cancer that LLM trainingsets are to copyright.