RE: https://tldr.nettime.org/@tante/116605858023186072

Google Search rests on a social contract: their bots can crawl our sites, they can index our sites, and they can show excerpts of our sites because

and •only because•

they send people to our sites. •Our• sites, our words, with our design, with our links, with our context and our aesthetics, shared the way we want to share them.

Google is announcing — unambiguously and with great fanfare — that they are now fully breaking that already-ragged contract. We should reciprocate.

1/2

Quick strategy discussion, for those who understand Google indexing and SEO:

If I want to yank a web site out of Google’s now-fully-extractive search, should I (1) disallow googlebot in robots.txt or (2) add `<meta name="googlebot" content="noindex">` to all the page headers?

The goal here is not just to remove my contributions to the commons from Google’s results, but to •make Google aware• that sites are pulling consent. What will best do that?

2/2

Same question as the previous post, except for Wkipedia. What would you like to see them do to send a shot across the bow?

Or…well, it’s Wikipedia. Maybe more like a shot to the hull.

3/2

Going with meta noindex for now. My thinking is that this actively tells Google to yank already-crawled content from their index, whereas they might take a robots.txt entry to mean “do not update, but keep showing last fetched.”

OK, a •lot• of replies need this reponse:

Yes, of •course• they will start ignoring robots.txt etc as soon as they think it hurts their business. Of course.

It is important to •force that fight•, rather than just capitulating in advance.

Defeatism is a form of surrender. Cynicism is surrender. Despair is surrender. Nihilism is surrender.

Our job is to •care• and to •keep caring• and to •keep doing and keep building• and to •endure• longer than them.

@inthehands hard agree (even though I'm thinking of this in broader terms). It can be so difficult to keep the will to do so, in practice... but it's important to do it whenever we can.

@inthehands

Damn straight.

Keep the faith, baby. And take care of each other.

@inthehands
I doubt the cynicism = surrender part. Cynicism is refusing to surrender in the face of an overly mighty enemy.

@musevg dunno, could go either way. i know people who are cynical, so they don't actively impede destructive institutionalism because they have an honestly pretty accurate view of the power imbalance.
then again, i also know cynics who love to impede the movement of monoliths specifically because they are aware of that same power imbalance.

@inthehands

@inthehands but it's haaaaaaard 😐

@inthehands On the one hand, I agree - if nothing else, because caring, doing and building are what I *want* to do.

But on the other hand: "We will rebel against the AI oligarchs by creating *even better* training data for them!"

@inthehands It's important to note that search indexing is considered "transformative" and thus fair use *because* it does not supplant the market for the original content. That goes out the window when the product functions to capture traffic that would otherwise go to the cites. They are acting with impunity, but existing copyright law addresses this if courts find it to be not transformative.
@jedbrown @inthehands I can only go by German/EU law, hand here it is not transformative (becaise duh!). The reproduction is the key thing here: if you reproduce another's work outside of private use, you are violating Urheberrecht (creator's rights): priviledges enshrined in law to the creator of a work (some of which can be licensed out). One of these is distributing reproductions.
E.g. any time you upload an image to SM, their ToS say you grant them license to reproduce (amonh others).
@jedbrown @inthehands sure, but I'm pretty sure US law would consider ignoring robots.txt as hacking ;)

@inthehands Two quotes from Pratchett comes to mind

>>> “All witches are selfish, the Queen had said. But Tiffany’s Third Thoughts said: Then turn selfishness into a weapon! Make all things yours! Make other lives and dreams and hopes yours! Protect them! Save them! Bring them into the sheepfold! Walk the gale for them! Keep away the wolf! My dreams! My brother! My family! My land! My world! How dare you try to take these things, because they are mine!

>>> "We look to ... the edges," said Mistress Weatherwax. "There's a lot of edges, more than people know. Between life and death, this world and the next, night and day, right and wrong ... an' they need watchin'. We watch 'em, we guard the sum of things. And we never ask for any reward. That's important.”

@inthehands thank you friend. Adding your quote to my common place book
@inthehands @philbaker1 I don't think cynicism is surrender. Cynicism is being aware of what could happen and preparing yourself mentally for it. Just because I'm cynical about some things doesn't mean I've given into them. It means I'm aware of them and I know how bad they can get.
@inthehands My site is tiny, but I had to make sure and add that into my own. I don't care if Google never happens to come across it, the idea of them using my site to train their LLM is sickening.
I also have to say, your site is absolutely gorgeous.

@goetic

Thank you!! My site is a labor of love, handmade from the ground up for the few people who will find it.

@inthehands annoy people into doing better.
@inthehands Be a lighthouse in the tempest.

@inthehands The web that existed before the Googlebot is still there. Some parts of it are gone, others have emerged. A quietly thriving universe.

That same web will still be there when Google is given a coup de grâce after being mortally slopped under the weight of its hubris. We'll all make sure of that.

@inthehands I think of nihilism as exactly about valuing what we care about and building a world that also values it.
"Respect what you love" has become something of a mantra of mine.
Google would say that they are in the right because for them meaning & value originates somewhere like the stock market, and if we care about what's on the web and the contributors to it, we must promote & support the value of that, we must support values which originate somewhere else, somewhere that exists beyond our financial markets.

@inthehands One of the things I've done recently is to bring enforcing robots.txt within my webserver engine. The /robots.txt itself still exists; the vast majority of it is a list of bots that are `Disallow: /` .

I still get a few of these bots attempting to hit the site, so it's definitely doing something.

@inthehands you can block their bots at the network level
@glassresistor @inthehands are you doing this in a particular way? Basically looking for different approaches.
@rooneymcnibnug @inthehands filtering by user agent, ip address, cloudfront no bots acl config, by load
@glassresistor @inthehands oh okay I was misinterpreting the statement, my bad

@inthehands All this, and a little more! When Google *does* start ignoring robots.txt and other mechanisms, that's another victory for us, not them, even if it means we have to react to it.

Not all of Google's infrastructure is servers in a giant building, or software systems running on top of it, or even offices full of stressed out tech workers. Part of their infrastructure, the cladding on the castle walls, is their false pretense of being good citizens on the internet. When we call their bluff and they eventually drop the pretense, that's us getting them to tear down the outer layers of the castle themselves. We know what they are, and we can make them admit it, and that's power.

@inthehands I know of at least one professional artist who has deliberately poisoned their images, in an attempt to deter AI scraping (mostly because the scrapers blast her small site and effectively DoS it). If they follow robots.txt, they're not affected... but they were already ignoring robots.txt

I just read an IARPA paper that said poisoning as little as .1% of training data can disrupt a model. If content creators choose to deliberately poison content that they ask not to be scraped, it might be a nice way to deter bad behavior.

The tools I know of work on imagery, but with effort people may come up with stuff that works on data as well. E.g., burying base64-encoded malicious prompts in your text, posting tables as poisoned images rather than text, etc.

Seems like we should start organizing and taking firm action now, before AI companies start buying politicians and making such defenses illegal.

@inthehands And since I saw the question (which was immediately deleted - they probably googled the answer after asking): You use a tool like Nightshade (https://nightshade.cs.uchicago.edu/whatis.html), which modifies the image in a way that's imperceptible to humans, but very visible to AI, effectively making AI "see" the image differently than a human would. When used in AI training, the AI may "see" a toaster when the picture (what a human sees) is actually a photo of a person sitting in a car. When the AI is then asked to generate a picture of someone in a car, it outputs a toaster.

Obviously one image won't do this, but when used at scale it can have an impact.

@mathaetaes @inthehands > posting tables as poisoned images rather than text

Please **never** do that. Accessibility is more important than poisoning LLMs.

@raulmatias @inthehands Ooh, good point - I had completely forgotten about screen readers in that context.
@inthehands If they ignore robots.txt, they will be added to the block list in nginx.conf. My robots.txt has a note stating as much. There is plenty of company there!

@schamschula @inthehands

Mind sharing the necessary subset of the nginx config to enforce robots.txt as an nginx block list? Thank you.

@albertcardona @inthehands It involves a couple steps, given the idiosyncrasies of the nginx regex support (no full pcre here!).
I keep two classes of blocked agents: (1) bad agents; and (2) scrapping false agents. A third regex unblocks agents that are false positives (due to (2)).
@inthehands in forcing that fight, google is going to find that the rest of the internet already has sophisticated tools for this fight. My anubis config should already be blocking google.

@inthehands Google executives don't care about consent. Google executives care about their bottom line.

Copyright class action seems most promising. Along with adversarial material to poison their scrapers

@inthehands In that case, it's better if we send them poisoned data instead, using iocaine [0] or nepentheses [1].

[0]: https://iocaine.madhouse-project.org
[1]: https://zadzmo.org/code/nepenthes/

iocaine - the deadliest poison known to AI

@inthehands

I’m mostly worried about their “agentic” part, because that sounds like new infrastructure with possibly different user agents etc., so harder to ban, and I’m 💯 sure it will DEFINITELY have no “social contract” whatsoever.

@inthehands You actually can kinda enforce noindex / robots.txt!

If you set a page as no index, consider adding somewhere in it a random, non-displayed, non-screenreadable link titled something human-obvious like “ban-me-382972.html”. Add a human readable warning in the HTML comments. Then a little quick HAProxy config can IP-ban anyone who nonetheless tries to load it. It worked like magic on my server to reduce AI bot load. And all they had to do to avoid a ban was respect my robots.txt!

@inthehands this is a fence-post defense against this, google Will Not Care

just start poisoning the data once you detect that google is the one fetching it, just absolutely fucking destroy their LLM output

@ShadowJonathan @inthehands agree. I don't think defense is the best reaction to sustain a healthy internet. this rhetoric has been untrue since... Google (other similar corps).

random offensive approach such as collective data poisoning, public exposé, factual based journalism, education, jailtime, guillotine & other accountability and positive encouragement should coexist to foster the internet to recover better

@ShadowJonathan @inthehands they alsof are pretty involved with the contents of the standard (75% of the authors), so luring the crawler into a pit of crappy data is probably your only way to protest besides avoiding to hand any of your money and attention to them.

Link: https://datatracker.ietf.org/doc/html/rfc9309

RFC 9309: Robots Exclusion Protocol

This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for the protocol, instructions for handling errors, and instructions for caching.

IETF Datatracker
@wsslmn @ShadowJonathan @inthehands Visit my crappy data generator at https://ptmcg.pythonanywhere.com/pyrac A Mad-libsian version of the old Racter program. Refresh for new crap content. (Not AI generated, just obnoxious Python.)
I played with this some more, now cites fictitious books by various authors.
@ShadowJonathan @inthehands what if we honeypotted them? Get them to ingest gigabytes upon gigabytes of LLM generated text to ruin their.efforts?
@inthehands also probably worth it to submit a pagemaster/webmaster request to them to directly tell them to deindex your site. Also DMCA takedowns to Google are usually effective. If you're in the jurisdiction of Australia you're potentially able to go after them iirc. (The Australian government went after them for embedding news articles in their output or something)
@inthehands
What guarantee does one have that Google will abide by these restrictions?
@inthehands meta noindex it is, definitely. robots disallow can actually hurt the process, since google cannot access the file with the noindex header and therefore won't deindex.
btw, they do indeed respect noindex and robots.txt ATM, since its qute easy to check if pages still get found. Then again, you never know what does not show up in search but is used for training (without giving credit, obv.) anyway. As far as i see, google still remains more standard compliant as e.g. OpenAI.
@korrupt @inthehands
Then my question is: Will Google claim that their AI search isn't subject to the old conventions and use that data to train AI and serve those results in their new format?
@inthehands As I said just a while ago: Every big tech press event these last few years have felt like "Announcing our exciting plans for oligarchs to strip-mine the entire world and immiserate all of humanity! Get on board, and also death to the unbelievers!"

@datarama @inthehands

My recommendation is always "Follow the Money" .
Google is now an adjunct of the fossil fuel industry & its fossil fuel funded public corruption.

Go after the wealth of the billionaires & oil oligarchs funding Google's #Enshittification

#PrinceBonesaw
Alwaleed bin Talal
Chris Hohn
Elon Musk
Sergey Brin
Peter Thiel
Larry Ellison
Charles Koch

https://www.bloomberg.com/news/articles/2018-04-06/google-thiel-stand-out-in-saudi-prince-s-silicon-valley-tour

https://www.washingtonpost.com/technology/2025/05/13/trump-tech-execs-riyadh/

https://www.wsj.com/finance/investing/chris-hohns-tci-made-18-9-billion-last-year-shattering-hedge-fund-records-e155153b

https://www.sfgate.com/tech/article/billionaire-hohn-more-google-layoffs-17736530.php

Fossil fuel phase out.

Google, Thiel Feature in Saudi Prince's Silicon Valley Tour

Saudi Crown Prince Mohammed bin Salman wrapped up a whirlwind tour of technology titans on Friday, part of a three-week U.S. visit focused on economic opportunities to diversify the oil-rich nation.

Bloomberg.com
@datarama @inthehands it’s going to be death to believers and unbelievers if they keep getting their way.
@JoBlakely @datarama @inthehands
True. The elites don't reward loyalty, unless they think they can get something out of it in the future. Once they can no longer extract anything more from you, they immediately throw you to the wolves.

@inthehands @datarama

Here's how I've seen the response to Google's latest bullshit: