Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advice suggest to update your robots.txt to disallow Googlebot (their original search index robot).

Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

Edit: robots.txt is not a catch-all, since it is only a suggestion for "nice" crawlers, and its instructions will be ignored by "nasty" crawlers. I reworded the post from "advice" to "suggest", since this action might not actually change anything for you.

#robotstxt #webSearch #noAI

Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/

So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).

However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

That is some bullshit.

#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

Create and Submit a robots.txt File | Google Crawling Infrastructure  |  Crawling infrastructure  |  Google for Developers

A robots.txt file lives at the root of your site. Learn how to create a robots.txt file, see examples, and explore robots.txt rules.

Google for Developers
Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
https://weirdgloop.org/blog/clankers
#via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+
Aggressive AI scrapers are making it kinda suck to run wikis

Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.

Weird Gloop
just over here modifying #robotstxt to block everything #google like a normal person in 2026

Guests on our own web

A few months ago I spun up a new VPS on Linode, London datacentre. Nothing special – Debian, Nginx, a Let's Encrypt certificate, a domain I was going to use for my daily notes and my homelab experiments. No link posted anywhere, no entries in my feeds, no backlinks from the sites I run. Just a freshly assigned IP, from a subnet that a week earlier had belonged to someone else.

[...]

https://write.as/jolek78/guests-on-our-own-web

Guests on our own web

A few months ago I spun up a new VPS on Linode, London datacentre. Nothing special - Debian, Nginx, a Let's Encrypt certificate, a domain...

jolek78's blog

Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

https://habr.com/ru/articles/1027876/

#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

Последние полгода я работаю над  VibeMuvik  — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать....

Хабр

How does your robots.txt look like?

#question #fedipower #website #robotstxt

ChatGPT는 직접 읽고, Gemini는 안 읽는다, nginx 로그로 본 AI 트래픽의 실체

AI 어시스턴트 8개를 nginx 탐침으로 실측한 결과. ChatGPT·Claude는 직접 읽고, Gemini는 읽지 않습니다. AI 트래픽의 두 신호를 구분하는 방법을 소개합니다.

https://aisparkup.com/posts/11579

The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims

https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/