Mastodawn

Oh wow, #OpenAI is #scraping #CT #logs like a kid in a candy store 🍬. Apparently, they're on a mission to hunt down... robots.txt files? 🤖🗂️ Because who doesn't love a treasure trove of 404 errors and TLS certificates? 💾🔍
https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3 #robots_txt #404_errors #TLS_certificates #tech_news #HackerNews #ngated

benjojo:

lol. I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from ...

WebHeads United Jun 25, 2025

Thinking about your robots.txt file? It might seem counterintuitive, but disallowing RSS feeds and certain pagination paths can be a smart SEO move.

This technique helps search engines focus crawl budgets on your most important pages to avoid potential duplicate content.

This post on WebHeads United looks at the technical reasons behind this strategy and whether it's right for your site.

Read the SEO deep dive: https://webheadsunited.com/why-disallow-rss-feeds-and-pagination/

#SEO #TechnicalSEO #CrawlBudget #WebDev #robots_txt

Feilner IT Aug 12, 2024

#konterfAI, the #AI #modelpoisoner for unfriendly, respectless and malicious AI #scrapers / #crawlers has now an interesting statistics function that collects data about the nasty "guests" that don't respect your robots.txt.... Enjoy!
#robots_txt
See update on
https://korium.org/2024/08/02/konterfai/
and version 0.2.0 on
https://codeberg.org/konterfai/konterfai/releases/tag/v0.2.0

konterfAI – An AI tool to poison disrespectful AI’s models and training data (Updated)

Some open source people have published code on codeberg that can be used in defense of your web server (or home network). It's called konterfAI and works anywhere where there's docker (or ollama itself), even on a raspberry Pi and is amazingly simple. konterfAI is a proof-of-concept for a model-poisoner for LLM (Large Language Models)...

korium.org

Qiita - 人気の記事 Jul 28, 2024

robots.txtを取得しクロール拒否されていないかチェック①
https://qiita.com/ishi720/items/d985bb711744ce9864fb?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #PHP #robots_txt #クローラー

robots.txtを取得しクロール拒否されていないかチェック① - Qiita

概要robots.txtを取得しクロール拒否されていないかをチェックするプログラムをPHPで作成したいと思います。今回は、robots.txtを取得する処理を作成します。次回は、クロール拒否さ…

Qiita

Snagburz Jul 26, 2024

Please notice that I've upvoted that comment. I want upvotes for my comment, too.

https://arstechnica.com/civis/threads/non-google-search-engines-blocked-from-showing-recent-reddit-results.1501990/page-3#post-43018868

#reddit #robots_txt #degoogle #SearchEngines

Non-Google search engines blocked from showing recent Reddit results

Updated robots.txt file hits Bing and others without a Reddit deal. See full article...

Ars OpenForum

Barry Schwartz Jul 5, 2024

Are you using a CDN and don't want to manage two robots.txt files? You can redirect your www version to the CDN and manage it all there says Google's @methode https://www.seroundtable.com/robots-txt-cdn-37678.html

#robots_txt #cdn #google #search #seo

Google: Using A CDN & Want One Robots.txt File, Redirect Yours To The CDN

Do you use a CDN for some or all of your website and you want to manage just one robots.txt file, instead of both the CDN's robots.txt file and your main site's robots.txt file? Gary Illyes from Googl

Search Engine Roundtable

Come On Giant Asteroid!Jun 26, 2024

No.

#robots_txt

PointlessOne

Jun 22, 2024

I’ve made a little something, so I thought I'd share.

Gort is a robots.txt parser and evaluator. It implements RFC 9309.

More details in the ReadMe: https://github.com/pointlessone/gort

#Ruby #rubygem #release #robotstxt #robots_txt #rfc9309

GitHub - pointlessone/gort: robots.txt parser and evaluator

robots.txt parser and evaluator. Contribute to pointlessone/gort development by creating an account on GitHub.

GitHub

Ben Keith May 6, 2024

My local government just launched a site redesign, changing CMSes and permalink structures.

They didn't set up redirects for old URLs.

Half the site is still blocked in robots.txt.

I'm professionally flabbergasted.

#webdev #redirects #robots_txt

Inautilo Jul 8, 2023

#Development #Initiatives
Google to explore alternatives to robots.txt · Generative AI would require new machine-readable methods https://ilo.im/13z0t1

_____
#AI #GenerativeAI #ChatBots #BotAccess #Website #WebDevelopment #WebDev #Community #Discussion #Protocol #Robots_txt

Google to explore alternatives to robots.txt in wake of generative AI and other emerging technologies

Google said it will engage with the web and AI communities with public discussions during this process.

Search Engine Land