@rgs I was always under the impression that the data scraping was part of the cost of training.
And a major problem is that the people who own the AIs aren't the ones paying that cost.
That's what I meant.
I doubt anyone has the data, would love to see it.
GPT-4 training is order of 1 GWh.
If we assume that the total volume scraped is order of 1 TB (wikipedia is order of 10 GB) and we assume that it is served at a rate of 1 MB /s, then it takes about 1 Ms to serve all that data, or about 1/3.6 kh. If the server would consume 1kW while doing this, then we are looking at order of 1 MWh (I'm rounding up here from 0.28 to 1; I am also ignoring the energy consumption under normal operation.)
Now, if my estimate is 100x off (say 10x more data and 10x times more time to serve), then it could be 100 MWh for a single visit of the crawler. Let's go with that as an upper limit.
Assuming the training frequency and scraping frequency are the same, then training still takes more energy than scraping.
This is a very crude estimate, but it should give some idea.
Of course I would hope that most of those scrapers are not used to train a GPT-4 competitor, because that would be awful. If it is a smaller model, then the training energy cost might very well be lower than the scraping cost.