Among the many things Doctorow gets wrong in That Post is this:

"It's not 'unethical' to scrape the web in order to create and analyze data-sets. That's just 'a search engine.'"

Apart from the fact that AI companies are particularly malicious in the way they scrape the web, I'd say we accept search engine scraping mostly on the premise that it's done for the benefit of the scraped sites. There's no such principle of mutual benefit in AI scraping — the AI company gets the value of the data scraped and you get bupkis at best, and possibly DDoS'd

@lrhodes he's right that it's not 'unethical' to scrape the web in order to create and analyze data-sets. he's wrong that that's what LLMs are doing. they scrape the web in order to reproduce the scraped content. by definition.

@Yuvalne

Unless the LLM vendors are lying, which is actually likely, they aren't reproducing anything more than word frequencies across very large samples. If that's true, then scraping for that is entirely within ethical boundaries.

@lrhodes

@screwturn @lrhodes no, that's incorrect. an LLM definitely doesn't just plot out word frequencies, you don't need an LLM to do that. the whole point of the "attention is all you need" paper is to create a method to replicate texts more faithfully.
@lrhodes plus the idea of people being able to establish consent and boundaries for what should be scrape-able and what shouldn't on their web sites has been around since basically the beginning of the web (robots.txt). the only ethical reasons i can think of for ignoring robots.txt would be things like holding corporations and governments to account. just "creating and analyzing data sets" on its own isn't enough justification

@lrhodes

You make an interesting point, and this brings up something that has been bugging me.

It almost feels like trying to equivocate what Aaron Swartz (RIP) was being charged with and what these sloperators are doing.

I hadn't made that connection until you posted this here, and I don't want to give these evil companies any ideas. I think if anyone tries to say those two things are the same, that argument should be rejected immediately and loudly.

I'm not suggestions anyone has done this yet, but I could see that move coming.

@lrhodes

"Among the many things Doctorow gets wrong"

OMFG!
DO YOU MEAN DOCTOROW IS NOT GOD AMONGST MEN WHO ONLY ISSUES ANGELIC DOCTRINE FROM HIS MOUTH?

It takes a lot of courage to even utter a mild critique to the prophet!

#respect

@lrhodes

"I'd say we accept search engine scraping mostly on the premise that it's done for the benefit of the scraped sites"

I would qualify this somewhat by pointing out how, independent of AI, this acceptance ultimately led to Google benefiting from scraping websites at the latter's expense. The value proposition of Google indexing your site is it draws more visitors to your site who may not have known about it otherwise.

@lrhodes This isn't an absolute good - some sites might not want that kind of attention - but it's easy to see why it might appeal to a large number of people. Once Google starts selling ads, though, that value proposition tilts against their favor; those websites become competitors for ad impressions (clicking through a given result means users spend more time there and less on Google).
@lrhodes Fortunately for Google, they already had an effective monopoly on the search engine business by this point, so it was easy to scrape those sites for data to power the features that kept users on Google.

@Video_Game_King @lrhodes Let's not forget that this notion that one obviously wants to "draw visitors" or attract more traffic to one's website is bonkers: the owner of the website has to pay for the corresponding resource usage on the server (additional network traffic, CPU load, ...) and often doesn't get any direct benefit.

IOW, this notion itself is predicated on turning such accesses into money, e.g. via advertisement.

@lrhodes I questioned this too. A search engine refers you to the original author or creator's work. It's like the difference between a quote or reference where you name the originator, and plagiarism. Maybe Doctorow is not aware of that distinction. I can forgive him because he seems aware of a hell of a lot of things the rest of us need to know about.

@lrhodes
Also also, 90% of my objection would go awah if they just weren't so *bad* at it

Asking for the same page multiple times a second isn't scraping lol