"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the #RobotsExclusionProtocol, a method for website owners to prevent automated bots from crawling their pages using a file called #robotstxt."