“This machine kills AI.”

@scottfgray It's crazy how people don't understand that AI is a tool, not a "get out of work" free card.

I use chatgpt for helping me in programming when I get stuck, but I don't just copy and paste code from it or tell it "Write me the whole program" lol

@charadon @scottfgray And even that is dicey, given the phenomenal amount of illegally obtained and processed work that was used for the AI and where the code you ingest into ChatGPT goes (in case of proprietary software you work on). Not to mention the ethical aspects.

The only model I'd ever consider using for anything important is a self-hosted one that went all and above to ensure ethically correct training data and is open source. Something like Mistral-7B but more ethical (I guess).

@Natanox @charadon @scottfgray What do you think is a way to guarantee that training data was collected ethically correct? Do you believe that training on this more ethical corpus can lead to competetive results like Mistral-7B or LLaMA 3? Is your opinion that one of these models was trained more ethically?

@feliks @Natanox @charadon @scottfgray

The company displaying the licence of the works they used to train their pile of linear algebra. That is the only proof

@Archivist @Natanox @charadon @scottfgray I believe this hardly to be feasible. Do you?

@feliks @Natanox @charadon @scottfgray

Hardly feasible? Tusky, an open-source and free app for Fediverse, lists 20ish licenses of the code they used. Not only is it feasible, but by law as written it is required.

@feliks @Natanox @charadon @scottfgray

Google Classrooms lists hundreds of licenses of the code from open source project they used. It is feasible and required. Credit must be given where it is due.

@Archivist @Natanox @charadon @scottfgray You seem to be talking about code while I try to talk about image data. There is a significant difference in workload to these modalities: People tend to tag their software with licenses more than images are being tagged in their metadata. The latter is a massive challenge in processing data properly

@feliks @Natanox @charadon @scottfgray

Untagged images should always assume to be protected. Images with licenses should they be collated in a model should remain protected under their same license. The same is applied with software.

Also AI models do not respect software licenses any more than they do artistic licenses. Which is definitely a problem

@feliks @Natanox @charadon @scottfgray

Derivative work is derivative work, the way it was derived is irrelevant

@feliks @Natanox @charadon @scottfgray

Any AI generated image for which a single CC-BY-SA was used to create the model needs to, according to the license, be tagged with the name of the original author, and be shared under CC-BY-SA. Copyleft exists for images too, so do intellectual property rights

@Archivist @Natanox @charadon @scottfgray I see these license breaches also as a problem, same as you. I can only hope that this leads to a similar moment as with OpenWRT but I doubt it. On the other side I don't see these systems could have been created without the vast amount of training data. In a competitive environment it appears to be a competetive advantage to neglect these licenses. Might shaping of or removing competition from this environment be a valid goal? I'm not sure

@feliks @Natanox @charadon @scottfgray like I said before, mugging people gives you a very competitive advantage in society. So does threatening minorities. So does slavery. So does human trafficking. So does organized crime.

A system that can't exist without committing large scale intellectual property fraud should not exist.