As far as I understand #gnu #gpl #agpl and other common #copyLeft licenses don’t prevent openai and friends from using licensed content for training. The reason is the model itself isn’t a derivative. The training data is used transformatively in an abstract format.

This feels unethical. An LLM cannot provide value without training data. It is especially bad because at least openai specifically claims to be ethical in its sourcing of data.

In addition, there is no public information about their process of collection. What happens when they encounter #ccNC licensed or other prohibitive licensing?

I suppose that’s not a unique to AI company problem though.