The most important thing to keep in mind when watching the NYT vs. OpenAI case is that we really want either an absolute annhilation of GenAI as a possibility, OR we want a relatively restricted judgement.

Importantly, what we do not want is a world where only Meta/Google/Microsoft can afford to make these tools, and only outlets like NYT/USGov/Major Publishers can create and influence models.

It's a pretty narrow path, tbh. I'm worried.

Just to make my stance clear: It'd be very sad if we basically decide all data science is illegal. The precedents of "analysis and modeling of a corpus are not fair use now" would be quite bad in a world that's in desperate need of a whole lot of very smart logistics very quickly.

Judgements that favor OAI/MS/G/M/Anth via making a moat will actually *increase* the amount of AI you see in stuff, because well now they paid for it.

Similarly, I don't think I want a world where we get those models AND they're only trained on material that can be held under a single corporate umbrella. We'd immediately take many steps back on our goal of unbiasing the data that informs the models. It'd offer INCREDIBLE power to the governments and their favored media outlets. It'd offer even more power to publishers, who would just make all their clients wave rights, and collect even more profits.

Sadly lost in all this is the idea of small individual artists getting their due for their work or having choices here.

Probably the notion of an independent and non-unionized artist will vanish from western nations. Even before GenAI, we've already seen the total erosion of that as a professional category over time.

@Elucidating not only that but it would basically criminaloze everyone who learned a languague by consuming media...

https://infosec.space/@kkarhan/112050659129867665

Kevin Karhan (@[email protected])

@[email protected] how about BOTH being bad? The one as being a waste of computational resources and the other as being assholes who refuse tp understand how learning works. Because otherwise a shitton of people whould be lifetme #DebtPeon|s to the #Contentmafia: https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

Infosec.Space

@kkarhan No, no I don't think that's true. Look, we need to acknowledge that this is a question about fair use! Is it fair use to build a mechanical model off someone's published materials, or not.

I think this decision has really far-flung implications (e.g., congratulations now Google has to pay spammers for the right to filter them from gmail, how's that gonna go?)

@kkarhan Now that we've established that LLMs are not actually sentient or about to become sentient, we can see that they're a modeling technique with lots of cool applications.

So, do we suggest the overall value to the world of this modeling permits fair use, or do we say that it does not.

I suspect the courts will maximize government authority, which is to say to maximize megacorp authority.

@Elucidating how about BOTH being bad?

The one as being a waste of computational resources and the other as being assholes who refuse tp understand how learning works.

Because otherwise a shitton of people whould be lifetme #DebtPeon|s to the #Contentmafia:
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

@kkarhan On that link, I know a person who very obviously had github copilot infringe their code copyright because upon prompting it'd reliably reproduce their code without license markers or attribution.

They are a world renowned expert. A model reproducing work to its customers without attribution is clearly and obviously a violation of copyright.

Copyright is a fucky wucky law I'd love to see annhilated, but I want to be realistic about what the real outcomes here can be.

@Elucidating Given that for every programming languague there are very much finite possibilities to do something efficient and retain readability of code, that'll inherently happen!

For example there are #finite ways you can actually boot an operating system on every architecture - sometimes there may just be a single one.

In fact, I'd be surprised if #Linux and #Windows don't have several lines of identical [assembly and/or C] #code by virtue of how #ix86 & #amd64 work as architectures...

Just like there are a very finite permutations of car designs in terms of engine placement, drivetrain, steering wheel, etc. given the constraints and requirements regulators in key markets (EU, USA, Japam, "P.R." China, India, ...) mandate for road-legality.

It may sound diminishing but given any engineering task, equally qualified and trained personnel will provide identical reaults by virtue of having the same thoughts at hand.

Programming in this regard is like a mixture between scrabble, pipedream and electric installations:
Repeating Challenges and very much finite correct solutions.

In this regard, if said model was trained by the same reference literature the programmer used, it'll literally spit out the same code!!!

@kkarhan Being really crystal clear: if an LLM spits out a perfect slice of optimized low level linear algebra code that you can search the text of on github and find something that is NOT public domain?

That's definitely copyright infringement. I must concede that under the letter of those laws and the expectations of those authors: it is so.

Let's not pretend that there isn't causality here. Models had to have their training methodology updated to disincentivize this.

@kkarhan Is copyright good? No. Can I magic it away with enough anarchist rhetoric? No. Is it good for microsoft to re-present academic work someone else did as its own unique output, while charging for it? Obviously not!

@kkarhan Most code, it doesn't matter too much. But some code, it can be quite distinct and unique.

And in fact, if I prompt it correctly I can get Copilot to obviously spit out code I wrote for Erlang quite awhile ago. Why? Because I'm literally the only person in the universe who cared to and could write it at the time.

@kkarhan I have long since changed the license to be public domain, as per my ethical view on this issue, fwiw.

@Elucidating The problem with #PublicDomain is that this doesn't work in every juristiction.

For example, #Germany doesn't acknowledge Public domain but only lapsed Copyright as one can't legally disavow authorship.

Otherwise it would be impossible to prosecure #Hatespeech...

So yeah, as shit as it sounds, #0BSD as released by @landley is kinda "necessary evil" in terms of #licensing...

https://en.wikipedia.org/wiki/BSD_licenses#0-clause_license_(%22BSD_Zero_Clause_License%22)

BSD licenses - Wikipedia