"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the #RobotsExclusionProtocol, a method for website owners to prevent automated bots from crawling their pages using a file called #robotstxt."

https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare

Data for A.I. Training Is Disappearing Fast, Study Shows

New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.

The New York Times

Reddit announced on Tuesday that it’s updating its Robots Exclusion Protocol, Tech Crunch reports:
https://techcrunch.com/2024/06/25/reddits-upcoming-changes-attempt-to-safeguard-the-platform-against-ai-crawlers/

#Reddit #AI #Robotsexclusionprotocol

Reddit's upcoming changes attempt to safeguard the platform against AI crawlers | TechCrunch

Reddit is updating its Robots Exclusion Protocol, which instructs bots about what the platform does and don’t allow to be crawled by third parties.

TechCrunch

For the most part, the Internet Archive limits its scraping to websites that permit it. The #RobotsExclusionProtocol (AKA #robots.txt) makes it easy for webmasters to tell different kinds of crawlers whether or not they are welcome. If your site has a robots.txt file that tells the Archive's crawler to buzz off, it'll go elsewhere.

Mostly.

7/