"AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material."

tl;dr: If you use public domain data (i.e. you don't steal from authors and creators) you can train a LLM just as good as what was cutting edge a couple of years ago. What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part.
So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.

https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html?src=rss

It turns out you can train AI models without copyrighted material

It's just a pain in the ass.

Engadget
@j_bertolotti Um, open-licensed doesn’t mean public domain! Very large parts of their corpus for example are 100% copyrighted. All of Hansard and Wikipedia is copyrighted and made available under an open licence, with conditions: for example, any generated text that’s obviously derived from them would be legally required to credit the source. Is their LLM proposing to do that?
@Adzebill @j_bertolotti I'm still waiting for all ai code to be GPL licensed
@jan_leila @Adzebill @j_bertolotti That would end every company's attempt to force AI on devs 😁