Oh wow, #OpenAI is #scraping #CT #logs like a kid in a candy store 🍬. Apparently, they're on a mission to hunt down... robots.txt files? 🤖🗂️ Because who doesn't love a treasure trove of 404 errors and TLS certificates? 💾🔍
https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3 #robots_txt #404_errors #TLS_certificates #tech_news #HackerNews #ngated
benjojo:

lol. I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from ...

Thinking about your robots.txt file? It might seem counterintuitive, but disallowing RSS feeds and certain pagination paths can be a smart SEO move.

This technique helps search engines focus crawl budgets on your most important pages to avoid potential duplicate content.

This post on WebHeads United looks at the technical reasons behind this strategy and whether it's right for your site.

Read the SEO deep dive: https://webheadsunited.com/why-disallow-rss-feeds-and-pagination/

#SEO #TechnicalSEO #CrawlBudget #WebDev #robots_txt

#konterfAI, the #AI #modelpoisoner for unfriendly, respectless and malicious AI #scrapers / #crawlers has now an interesting statistics function that collects data about the nasty "guests" that don't respect your robots.txt.... Enjoy!
#robots_txt
See update on
https://korium.org/2024/08/02/konterfai/
and version 0.2.0 on
https://codeberg.org/konterfai/konterfai/releases/tag/v0.2.0
konterfAI – An AI tool to poison disrespectful AI’s models and training data (Updated)

Some open source people have published code on codeberg that can be used in defense of your web server (or home network). It's called konterfAI and works anywhere where there's docker (or ollama itself), even on a raspberry Pi and is amazingly simple. konterfAI is a proof-of-concept for a model-poisoner for LLM (Large Language Models)...

korium.org
robots.txtを取得しクロール拒否されていないかチェック① - Qiita

概要robots.txtを取得しクロール拒否されていないかをチェックするプログラムをPHPで作成したいと思います。今回は、robots.txtを取得する処理を作成します。次回は、クロール拒否さ…

Qiita
Non-Google search engines blocked from showing recent Reddit results

Updated robots.txt file hits Bing and others without a Reddit deal. See full article...

Ars OpenForum

Are you using a CDN and don't want to manage two robots.txt files? You can redirect your www version to the CDN and manage it all there says Google's @methode https://www.seroundtable.com/robots-txt-cdn-37678.html

#robots_txt #cdn #google #search #seo

Google: Using A CDN & Want One Robots.txt File, Redirect Yours To The CDN

Do you use a CDN for some or all of your website and you want to manage just one robots.txt file, instead of both the CDN's robots.txt file and your main site's robots.txt file? Gary Illyes from Googl

Search Engine Roundtable

I’ve made a little something, so I thought I'd share.

Gort is a robots.txt parser and evaluator. It implements RFC 9309.

More details in the ReadMe: https://github.com/pointlessone/gort

#Ruby #rubygem #release #robotstxt #robots_txt #rfc9309

GitHub - pointlessone/gort: robots.txt parser and evaluator

robots.txt parser and evaluator. Contribute to pointlessone/gort development by creating an account on GitHub.

GitHub

My local government just launched a site redesign, changing CMSes and permalink structures.

They didn't set up redirects for old URLs.

Half the site is still blocked in robots.txt.

I'm professionally flabbergasted.

#webdev #redirects #robots_txt

#Development #Initiatives
Google to explore alternatives to robots.txt · Generative AI would require new machine-readable methods https://ilo.im/13z0t1

_____
#AI #GenerativeAI #ChatBots #BotAccess #Website #WebDevelopment #WebDev #Community #Discussion #Protocol #Robots_txt

Google to explore alternatives to robots.txt in wake of generative AI and other emerging technologies

Google said it will engage with the web and AI communities with public discussions during this process.

Search Engine Land