"AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material."

tl;dr: If you use public domain data (i.e. you don't steal from authors and creators) you can train a LLM just as good as what was cutting edge a couple of years ago. What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part.
So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.

https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html?src=rss

It turns out you can train AI models without copyrighted material

It's just a pain in the ass.

Engadget
@j_bertolotti i wonder how a LLM would sound if you can only use 100 year old texts due to public domain.
If we stick to public domain: it'll be a bit dated, and will require curation to prevent old common misconceptions from leaking into the corpus. If we also add copyleft data, the result fares better in terms of knowledge breadth. There is a project attempting to cure precisely that kind of corpus: huggingface.co/datasets/PleIAs…
PleIAs/common_corpus · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.