"AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material."

tl;dr: If you use public domain data (i.e. you don't steal from authors and creators) you can train a LLM just as good as what was cutting edge a couple of years ago. What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part.
So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.

https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html?src=rss

It turns out you can train AI models without copyrighted material

It's just a pain in the ass.

Engadget
@j_bertolotti It'd be interesting if AI companies lobbied to increase the works in the public domain by decreasing copyright duration. That's something I'd actually support. Copyright is too long. And it would then be a legal, more ethical, industry instead of pack of VC-funded thieves. Strange bedfellows.
@shanecelis @j_bertolotti
I suspect this idea will get increasing traction, as current copyright principles experience increasing strain.
Curious - what's an acceptably shorter period of time? Does it vary with type of work?
@leafless @shanecelis @j_bertolotti I would say, especially software copyrights should be much shorter.
@martinvermeer
And require source code escrow
@leafless @shanecelis @j_bertolotti

@notsoloud @martinvermeer @leafless @shanecelis @j_bertolotti

Copyright on software can be annoying, I am very glad #CopyLeft #OpenSource and #AllRightsReversed software exist.

But #SoftwarePatents are a curse upon all existence.

Would you like a cool mini game in your 10 minute loading screen? #NopePatented