Mastodawn

"The worst of these proposed standards would give websites far greater ability to automatically block legitimate, lawful scraping and crawling. For example, the AI Preferences working group is working on proposals to give publishers a way to express “preference signals” against crawling web data for AI-related purposes, including to train models, generate outputs, and help users search the web. These preference signals would be expressed through robots.txt and could potentially become legally binding in some jurisdictions.

Another working group, called Web Bot Auth, is pursuing efforts to protect sites from overly-aggressive bots that strain website resources—a positive goal that could meaningfully improve the internet in the AI era. But Web Bot Auth is simultaneously pursuing a much more dangerous path as well: standards changes that would enable sites to cryptographically identify bots so that they can more easily block anyone they wish—not just “bad” actors, but competitors, dissidents, or anyone who hasn’t paid for the right to access sites using automated tools. If sites restrict crawling to a preapproved list of cryptographically authenticated bots, they could require licensing payments from those wishing to crawl their sites. This would close off the open web to researchers, archivists, and startups without the ability to pay for automated access.

Websites may have legitimate reasons to worry about AI’s impacts on their traffic and advertising revenue, but those reasons must be weighed against the benefits of the open web."

https://www.eff.org/deeplinks/2026/06/free-and-open-web-under-attack-ietf

#IETF #OpenWeb #WebCrawling #AI #Chatbots #LLMs

The Free and Open Web Is Under Attack at the IETF

The ability to access publicly available information using automated tools is a central value and benefit of a free and open internet. Automated access—often called crawling or scraping—powers important, useful tools for locating, preserving, and analyzing online information. For example, crawling...

Electronic Frontier Foundation

Hacker News May 14

Amazonbot is finally respecting robots.txt

https://xeiaso.net/notes/2026/amazonbot-respecting-robots-txt/

#HackerNews #Tech #WebCrawling

Amazonbot is finally respecting robots.txt

Thanks for giving me a viable business model Amazon!

Xe Iaso May 14

Amazonbot is finally respecting robots.txt

https://xeiaso.net/notes/2026/amazonbot-respecting-robots-txt/

#Tech #WebCrawling #OpenSource

Amazonbot is finally respecting robots.txt

Thanks for giving me a viable business model Amazon!

#Digital ⚓️ #Vagabond 🦈May 1

Related to my WAF header question the other day, AI crawling is also ruining RSS:

https://www.reddit.com/r/rss/comments/1t0jqis/many_sites_are_blocking_request_to_rss_links/

😔

#WebArchiving #WebCrawling #WebAPI #HTTPasAPI #API #RSS #ATOM #FederatedNews

PPC Land Apr 27

FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. https://ppc.land/openai-tripled-its-web-crawl-after-gpt-5-but-chatgpt-users-may-be-declining/ #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining

New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%.

PPC Land

PPC Land Mar 26

FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

Google-Agent joins the crawler list as AI browsing gets an official identity

Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users.

PPC Land

PPC Land Mar 15

FYI: Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. https://ppc.land/googlebot-is-not-a-program-google-engineers-finally-explain-what-it-really-is/ #Googlebot #SEO #WebCrawling #DigitalMarketing #SaaS

Googlebot is not a program - Google engineers finally explain what it really is

Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints.

PPC Land

PPC Land Mar 12

Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. https://ppc.land/googlebot-is-not-a-program-google-engineers-finally-explain-what-it-really-is/ #Googlebot #SEO #WebCrawling #SaaS #DigitalMarketing

Googlebot is not a program - Google engineers finally explain what it really is

Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints.

PPC Land

PPC Land Mar 11

FYI: Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. https://ppc.land/googles-secret-crawl-logic-finally-explained-in-one-page/ #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

Google's secret crawl logic, finally explained in one page

Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing.

PPC Land