Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI

https://sopuli.xyz/post/15590084

Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI - Sopuli

https://archive.is/2024.08.05-162750/https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/ [https://archive.is/2024.08.05-162750/https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/]

Properly following licensing, right?

No, see, because it’s “learning like a human”, and everybody knows that you’re allowed to bypass any licensing for learning. /s

But seriously I don’t know how they make the jump to these conclusions either.

This is a massive strawman argument. No one is saying you shouldn’t have a license to view the content in order to train an AI. Most of the information used to train these models is publicly available and licensed for public viewing.

Just because something is available for public viewing does not mean it’s licensed for anything except personal use.

The strawman here is that since physical people benefit from personal use exceptions in the law, machine learning software should too. But why should they? Since when is a piece of software ran by a corporation equivalent to an individual person?

Copyright licensing allows the owner to control how a work is distributed, not how it’s consumed. “Personal use” just means that you can’t turn around and redistribute a work that you’ve obtained. Not that you’re not allowed to consume it in a corporate setting.
Consuming is not the same thing as training.
A program of machine can be a consumer of something, although if you want to be technical you could say the person using the machine is the consumer. In actual computer science we talk about programs consuming things all the time.
In actual computer science you talk about AI all the time as well but it’s not actually intelligent is it? It’s just SmarterChild 2.0 and literally has no idea what word it said just before it’s current one. Words are often used inappropriately. The only thing computers can consume is data by definition, and consuming data is not the same as implementing it in a language model that you intend to profit from. This is data theft.

How intelligent it is or isn’t is irrelevant. We talk about much dumber programs than AI as being consumers of files and data including things like compilers. Would it not be person use for you to view a picture in a photo viewer or try and edit it in GIMP?

It’s not data theft at all unless the courts and law says it is. Ranting on lemmy won’t change that fact. Theft is a construct of law.

You can try to equate humans to computers all day, and you can even pass laws that says they’re the same thing. That does not make it true. A company using software to profit off data they have not licensed (whether it’s public or not does not matter! That is not how copyright law works!) is theft.
I am not equating humans with computers. These businesses are not selling people’s data when doing AI training (unlike actual data brokers). You can’t say something AI generated is a clone of the original anymore than you can say parody is.

I absolutely can. Parody is an art form, which is something that can exclusively only be created by human beings. AI is an art laundering service. Not an artist.

The law should reflect that these companies need to be first granted permission to use datasets by the rights holders, and creative commons licenses need to be given an opportunity to opt out of being crawled for these datasets. Anything else is wrong. Machines are not humans. Creative common copyright law was not written with the concept of machines being “consumers”. These companies took advantage of the sudden emergence of these models and the delay of law in holding their hunger for data in check. They need to be held accountable.

There are already anti-AI licenses out there. If you didn’t license your stuff with that in mind that’s on you. Deep learning models have been around for a lot longer than GPT 3 or anything that’s happened in the current news cycle. They have needed training data for that long too. It was predictable stuff like this would happen eventually, and if you didn’t notice in time it’s because you haven’t been paying attention.

You don’t get to dictate what’s right and wrong. As far as I am concerned all copyright is wrong and dumb, but the law is what the law is. Obviously not everyone shares my opinion and not everyone shares yours.