As many suspected:
“Midjourney Founder Admits to Using a ‘Hundred Million’ Images Without Consent”
As many suspected:
“Midjourney Founder Admits to Using a ‘Hundred Million’ Images Without Consent”
@moultano it's fair to say that analogy is not exact here (Dryhurst talks about this too), though I think the principle is the same on many fronts.
The scale could never be ofc.
@moultano
To the end of preventing memorization?
I feel like it's already past human artists in some ways by having a similarity-queryable dataset if people are worried.
And that post by OpenAI on deduplication and (ironically) that one on Stable Diffusion's regurgitation that notes the imagenet LDE sees no significant memorization make me very not-worried about memorization.
@TedUnderwood @moultano definitely: https://openai.com/blog/dall-e-2-pre-training-mitigations/
This one is a big deal!!!
https://arxiv.org/abs/2212.03860
And this one is frustrating but ironically reassuring given that they see dataset seems to be the primary problem/mitigator.
In order to share the magic of DALL·E 2 with a broad audience, we needed to reduce the risks associated with powerful image generation models. To this end, we put various guardrails in place to prevent generated images from violating our content policy. This post focuses on pre-training mitigations,
@TedUnderwood @moultano They LITERALLY CANNOT DETECT REPLICATION WITH IMAGENET!!! (Pardon my screaming.)
But nobody bothers reading the paper 🙃
@lowd @moultano @TedUnderwood Yeah! Not to mention the model card for stable diffusion explicitly states this issue!
But due to this paper even some many many people think this was some hidden secret. Was having to argue over this with an ML industry person just yesterday.
@TedUnderwood @moultano DALL-E2 usage also reassures me a ton here.
Otoh I'd be down to see people do differential privacy here: it just feels like enough just will not be on the data sourcing side :/
@danvanmoll
That is a plausible argument that sort of aligns with how the models work (though anthropomorphizes as well). I think that most generated art will not rise the the level of copyright infringement. At the same time artists are justified in their anger when their art is used without permission.
IMO we need to separate the implications of input/training and implications of output/generation.
@Riedl Suspected? This has always been out in the open. All the big image datasets have copyrighted images in them that wasn't given permission for, even those that try not to.
All the big AI companies think this is fair use. Even the big owners (Disney etc.) seem to not want to question that.
@Riedl I think it might help if more people understood exactly how generative AI works?
This seems like a pretty good primer...