| Host | https://zacyu.com |
| Accept-Language | en-US,en,zh,ja |
| Host | https://zacyu.com |
| Accept-Language | en-US,en,zh,ja |
"AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material."
tl;dr: If you use public domain data (i.e. you don't steal from authors and creators) you can train a LLM just as good as what was cutting edge a couple of years ago. What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part.
So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.
This is my PhD thesis
I did not ask for this
I did not consent to this
I did not approve of this
I was not compensated for this
I would not have advised this
I do not like this
And worst of all, the number of people who've read my thesis has still not increased.