Ok, training an AI model consumes a lot of energy. But surely scraping data to feed it consumes even more? if you factor in all the load on the scraped servers. Do we have numbers?
@rgs No numbers, but most of my work last week involved setting up mitigations on customers' servers that were being DDOSed by (presumably) AI scrapers.

@rgs I was always under the impression that the data scraping was part of the cost of training.

And a major problem is that the people who own the AIs aren't the ones paying that cost.

@rrwo maybe the data scraping from the scraper side, but what of the scrapees? (A couple of years ago, I worked for a large website that was aggressively scraped by all kinds of people with lots of computing power. It was costing a fortune, and blocking the scrapers was an arms race)

@rgs

That's what I meant.

@rrwo okay. Because that’s never made clear in estimates

@rgs @rrwo

I doubt anyone has the data, would love to see it.

GPT-4 training is order of 1 GWh.

If we assume that the total volume scraped is order of 1 TB (wikipedia is order of 10 GB) and we assume that it is served at a rate of 1 MB /s, then it takes about 1 Ms to serve all that data, or about 1/3.6 kh. If the server would consume 1kW while doing this, then we are looking at order of 1 MWh (I'm rounding up here from 0.28 to 1; I am also ignoring the energy consumption under normal operation.)

Now, if my estimate is 100x off (say 10x more data and 10x times more time to serve), then it could be 100 MWh for a single visit of the crawler. Let's go with that as an upper limit.

Assuming the training frequency and scraping frequency are the same, then training still takes more energy than scraping.

This is a very crude estimate, but it should give some idea.

Of course I would hope that most of those scrapers are not used to train a GPT-4 competitor, because that would be awful. If it is a smaller model, then the training energy cost might very well be lower than the scraping cost.