Mastodawn

RE: https://tldr.nettime.org/@tante/116605858023186072

Google Search rests on a social contract: their bots can crawl our sites, they can index our sites, and they can show excerpts of our sites because

and •only because•

they send people to our sites. •Our• sites, our words, with our design, with our links, with our context and our aesthetics, shared the way we want to share them.

Google is announcing — unambiguously and with great fanfare — that they are now fully breaking that already-ragged contract. We should reciprocate.

1/2

Show thread

Paul Cantrell 1d ago

Quick strategy discussion, for those who understand Google indexing and SEO:

If I want to yank a web site out of Google’s now-fully-extractive search, should I (1) disallow googlebot in robots.txt or (2) add `<meta name="googlebot" content="noindex">` to all the page headers?

The goal here is not just to remove my contributions to the commons from Google’s results, but to •make Google aware• that sites are pulling consent. What will best do that?

2/2

Show thread

Paul Cantrell 1d ago

Same question as the previous post, except for Wkipedia. What would you like to see them do to send a shot across the bow?

Or…well, it’s Wikipedia. Maybe more like a shot to the hull.

3/2

Show thread

Paul Cantrell 1d ago

Going with meta noindex for now. My thinking is that this actively tells Google to yank already-crawled content from their index, whereas they might take a robots.txt entry to mean “do not update, but keep showing last fetched.”

Show thread

Paul Cantrell 1d ago

OK, a •lot• of replies need this reponse:

Yes, of •course• they will start ignoring robots.txt etc as soon as they think it hurts their business. Of course.

It is important to •force that fight•, rather than just capitulating in advance.

Show thread

Mathaetaes 1d ago

@inthehands I know of at least one professional artist who has deliberately poisoned their images, in an attempt to deter AI scraping (mostly because the scrapers blast her small site and effectively DoS it). If they follow robots.txt, they're not affected... but they were already ignoring robots.txt

I just read an IARPA paper that said poisoning as little as .1% of training data can disrupt a model. If content creators choose to deliberately poison content that they ask not to be scraped, it might be a nice way to deter bad behavior.

The tools I know of work on imagery, but with effort people may come up with stuff that works on data as well. E.g., burying base64-encoded malicious prompts in your text, posting tables as poisoned images rather than text, etc.

Seems like we should start organizing and taking firm action now, before AI companies start buying politicians and making such defenses illegal.

Show thread

Raul Matias

@mathaetaes @inthehands > posting tables as poisoned images rather than text

Please **never** do that. Accessibility is more important than poisoning LLMs.

Show thread

Mathaetaes 1d ago

@raulmatias @inthehands Ooh, good point - I had completely forgotten about screen readers in that context.