Among the many things Doctorow gets wrong in That Post is this:

"It's not 'unethical' to scrape the web in order to create and analyze data-sets. That's just 'a search engine.'"

Apart from the fact that AI companies are particularly malicious in the way they scrape the web, I'd say we accept search engine scraping mostly on the premise that it's done for the benefit of the scraped sites. There's no such principle of mutual benefit in AI scraping — the AI company gets the value of the data scraped and you get bupkis at best, and possibly DDoS'd

@lrhodes he's right that it's not 'unethical' to scrape the web in order to create and analyze data-sets. he's wrong that that's what LLMs are doing. they scrape the web in order to reproduce the scraped content. by definition.

@Yuvalne

Unless the LLM vendors are lying, which is actually likely, they aren't reproducing anything more than word frequencies across very large samples. If that's true, then scraping for that is entirely within ethical boundaries.

@lrhodes

@screwturn @lrhodes no, that's incorrect. an LLM definitely doesn't just plot out word frequencies, you don't need an LLM to do that. the whole point of the "attention is all you need" paper is to create a method to replicate texts more faithfully.