Is there any benefit to the author of letting a for profit LLM like OpenAI spider and consume their writing? I can’t think of any.

If that’s true, then should we be rethinking Creative Commons licenses? I’m wondering if the current one we use at Medium is obsolete.

At least with something like Google there was an exchange of value: contribute to their search results and get traffic back.

@coachtony This is something I've been thinking about for a while. I imagine LLM training will create the next generation of paywall and poses a very real risk to the free and open internet. I'd like to see a standard equivalent to robots.txt for training, though I doubt most will respect it.
@mike @coachtony actually, violating the robots.txt is a felony in Germany.
@kkarhan Really? Any source for this?
@mike @coachtony robots.txt already allows you to specify that some pages should not be crawled.

@coachtony not that I’d do it, but there is one thing to consider. Much of what was fed into the AI was likely made by white, cisgender, heterosexual sexual, non-poor men which means that’s the lens all of the resulting output will be framed through.

For those of us who don’t fit that category we can get screwed with our pants on or be made more invisible.

So, a typical choice.

@secularshepherdess Ah, forced to feed in writing with diverse viewpoints or face an even more homogenous world? Sounds like a bad choice.
@coachtony given how oppressive homogeneity is to those who don’t want to be like everyone else. It’s a BS choice to be forced to make.
@coachtony The term "Data Dignity" by Jaron Lanier/Glen Weyl comes into mind.
@coachtony What got us Creative Commons and things like "No Discrimination Against Fields of Endeavor" at https://opensource.org/osd/ was not that there is no such thing as good and bad uses, but that it is a fool's errand to try to figure out which is which. I'm not exactly against people wanting to publish under a restrictive license, but I will offer a note of caution that restrictive licenses don't automatically produce the outcome you want either.
The Open Source Definition

Introduction Open source doesn’t just mean access to the source code. The distribution terms of open-source software must comply with the following criteria: 1. Free Redistribution The licens…

Open Source Initiative
@soaproot @coachtony Adding to that some CC variants require attribution which disqualifies LLMs that cannot give it.

@hramrach @soaproot @coachtony The problem is, if ML training is fair use, it is legally allowed to "break" all licensing terms. The license may say "you can't use this content at all for anything, ever, nonono, go away!" and it will be perfectly legal to use that content within limits of fair use.

https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

GitHub Copilot is not infringing your copyright

Felix Reda
@coachtony How about contributing our work and knowledge to the education and greater good of all mankind? Sure, if you are thinking about this in a purely selfish manner, it seems like a raw deal, but think of our work shaping the knowledge base of an LLM that helps people all over the world. Works for me! It's not like these AIs are spitting out our writing, word-for-word, we are contributing one or more parameters to the pool of billions.
@jasonjamesweiland You seem hopeful about the future. In the moment though, they are taking the knowledge that you already contributed to the world and remixing it in a way that diminishes it and often makes it wrong.
@coachtony As I said, it is an LLM. Your work is only a minuscule part of it, a bit of data it is trained on. Are you saying that authors don't take the work of all the other authors they read before them and mishmash it into something new? If your work is referenced using a web plugin that quotes search results, you are attributed. Are you saying that you are worried that a LM that works on billions of parameters is somehow plagiarizing your work? Everything in art is derivative.
@coachtony Haha! The legacy of Coach Tony must be protected at all costs! I'm just messing with you. I see where you are coming from. I just don't think the work of you and I will matter much in the pool of billions of parameters enough that we should be all litigious about our data. But, I have been known to be wrong. I am in fact, an idiot!
@coachtony It does seem very much like the post-Flickr era of open licenses largely became fodder for exploitation in the same way that most of open source software exists largely without any support from the extractive big tech companies which rely on it. Perhaps the requirement should have been equity, not attribution.

@anildash @coachtony does there need to be direct financial benefit here? Not every use of creative material needs to be a transaction. The theory of Creative Commons was to literally make a commons of creative works. Why is it a problem if software is also benefiting?

(I'm motivated by maximizing the current boom in AI research and development. I do worry that these systems benefit large commercial players right now but am hopeful that will change.)

@nelson @coachtony Yeah I’m not sure what the answer to that balance is.
@nelson @anildash well, not financial. But as we know there are many ways to be compensated including status and influence. I don't think there is any meaningful exchange back to the author.
@coachtony @anildash I agree, the individual authors are getting nothing (or in some cases harmed, as in "make art in the style of <living artist>". I'm postulating that culture as a large will benefit. It's a fraught argument but I think worth consideration.

@nelson @anildash

Yeah, I get it. We are projecting the future and there is no way to know who is right because we aren't going to A/B test this.

I lean more toward "let's have a Butlerian Jihad" but I don't fault you for leaning the other way.

@coachtony @vaurora @nelson @anildash And/or it might lead to a new level of if-you’re-not-ingested-you-don’t-exist. The devil’s bargain being to become part of the babble in exchange for being findable by search engines.
@anildash You know better than I, but don't big companies actually not need this as much? Like Facebook uses Linux, right – but the reason probably isn't that it's free, it's that it's the best choice. Smaller companies and startups are the ones that rely on it, large ones could afford something else? (I don't know I don't work in finance for a large company, so maybe I'm wrong.)
@coachtony Are they only scraping openly licensed content?
@ethanwhite @coachtony no. They’re using provisions in law to process even copyrighted data. Much of it is Common Crawl which only checks robots.txt.

@coachtony you may appreciate my thread from yesterday on this topic. i focused on source code + how the license under which it's released impacts this:

https://elk.zone/toot.bldrweb.org/@jbminn/110084548838376306

John Minnihan (@[email protected])

1/ a thread on AI code generators the longer i'm around, the less inclined i am to profess expertise on anything. but i do have lots of experience w/ source code hosting, licenses that apply to it + associated issues. so i'm expert-y i guess. scraping code from a public repo + using it to generate derivative works may be permissible if the original work was licensed to explicitly allow such use. the orig author(s) may be unhappy, but that's how licenses work (generally).

Bldrweb
@coachtony Is there any way to prevent it? Not being snarky, but I’m setting up a new site and was genuinely wondering if there’s something I should be adding to robots.txt now.
@jeffwatkins Yeah, robots.txt and paywalls are the most obvious.
@coachtony if these LLMs are already getting away with being trained on material that isn't permissively licensed, is there anything a license can even do?
@coachtony like if there were LLMs which were justifying their training set by saying that, oh, it's all CC-licensed so we can use it, then that would be one thing, but GPT-3 is definitely being trained on stuff that isn't licensed *at all*

@coachtony Well, AI can't create copyrightable works and learning their database is not a copyrightable or licenseable task, since otherwise school textbook publishers would be entitled to compensation (i.e. their income in lieu of a proper license) from every wageworker who learned from their works.

Because that's the de-jure equivalent w/ pirated Software...

@coachtony so no, a machine can't create copyright and trying to make it a license violation to learn and scrape won't fix the issue, as @senficon already pointed out:
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
GitHub Copilot is not infringing your copyright

Felix Reda

@kkarhan @senficon

Thanks, but this paragraph seems blatantly wrong to me.

@kkarhan Wait, why do you think AI can't produce copyrighted works? Aren't they defacto copyright of the owner of the AI? Or, if the owner says via TOS, by the director of the AI?

@coachtony @kkarhan A court ruled that they don't, and until it's repealed that's how it is.

Nonetheless, it ruled that a comics made of AI generated art is not subject to copyright which is a collection of works which is in turn subject to copyright which is a contradiction 🤷

However, the material that went into the LLM is for the most part copyrighted, and if the AI work is not then the result is pretty much solely the aggregation of these works that went in. Want to sue Microsoft?

@coachtony My open source code is licensed in such a way that no AI can be permitted to casually ingest and regurgitate.

Attribution is mandatory.

The whole "Wild West" of open ingest can not last. These can and must not be opt-out scraping processes.

As it stands, I literally can not touch these tools in relation to anything at work. I'm not even allowed to copy anything out of Stack Overflow, by Romulus.

@alice This reminds me of the early days of corporate adoption of open source when there was stricter verification of licenses. Enterprises were terrified of accidentally incorporating code with incompatible licenses. However, I feel like the corporate world has become more lawless and has taken an "I dare you to fight in court" mentality.

@coachtony Heh, amusingly, it's still a strong concern for the B2B multinational I work for. When my prior company (+ my project + me) were being acquired, we sent the codebase off to a pair of third-party auditors, one for pen-testing the other for license auditing.

They found a line of Stack Overflow code. I found back where I got it, searched down the answerer's GitHub, and was happy to discover he had packaged up the solution with a proper license.

Panic avoided. 😜

@alice This is why I can't work for big companies. I just can't stomach that as a definition for panic 😂

@coachtony Wasn’t my panic; I mean, in that instance I had just been lazy. It was a line or two of Python. Execs, though, saw a blocker to that acquisition.

Having been lead developer *and* product owner for 8 years gave me some interesting perspective across the realms of engineering and management. 😝

Loved that entire team. 🥰

@coachtony
I feel this line of thinking is much like open source developers who change their license away from an open one when Amazon or Google start offering it as a service.

Does it feel fair? No! Should you change your license? I don't know maybe.

But, you choose to release it under an open license, now there is a business model that uses that data, there isn't really putting that genie back in the bottle.

Now that it's here you do need to do the math if you want to keep participating.

@MadVikingGod I think AI is big enough to rethink my prior decisions.
@coachtony OpenAI would contend (wrongly, IMO) that copyright does not come into play. Therefore, they would not care what license or lack of license you apply.
@coachtony I feel kind of wary of "capitalism to save the world" solutions, but I wonder if there wouldn't be space for a startup to ethically generate images, paying authors according to how much their input weighted in the end result, and ideally giving the option to the authors to buy into the company (a sort of worker owned cooperative), giving them actual power on its policies to (hopefully) prevent future abuses.
@coachtony Does it hurt you to let it train on your content?
@ech It hurts if it replaces the thing I wanted to say with something worse and/or means fewer people read my version.
@coachtony
Wow, I never thought about CC licenses. Good point.

@coachtony Would you prefer if Disney trained an AI on only images and text that Disney owns, without paying anything to the authors or getting their permission?

Like, I don't get why a world where Disney has an AI generator and nobody else is allowed to make one is somehow better than a world where everybody can make an AI generator.

@183231bcb Are these the only two options? I don't think option two even accurately reflects the status quo. Other options:

* No LLMs at all. This is the Butlerian Jihad of Dune.

* LLMs pay and attribute their source material.

* LLMs have more limited opt-in coverage.

Etc.

@coachtony As long as corporations can own copyright, LLMs "paying and attributing source material" just means Disney training an LLM on media they own and don't have to pay royalties on. Disney can "opt in" their own media as training data for their own LLM.

Maybe you could try convincing legislators to abolish corporate-owned copyright, but that seems very unlikely to succeed.

@coachtony The problem is bigger than that: LLMs don't follow the licenses, so it doesn't really matter what license you use or what special clauses you add to your license. I expect LLM companies do not even make an effort to track what license the material they are spidering

@mcc I wonder if any other content companies, i.e. WordPress, Substack, StackOverflow, care?

In matters of law there are almost always two alternative paths: relationship or power.

I don't think the platforms are powerless here.

Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content

Getty Images is suing Stability AI, creators of generative AI art model Stable Diffusion. The stock photo company claims Stability AI ‘unlawfully’ scraped millions of images from its site.

The Verge
@coachtony @mcc Objaverse just ingested 800k CC licensed 3D models from Sketchfab in their training set. We now have noAI tags for users and no-scraping language in our TOS, but this happened before those went into effect.
@BartV @coachtony Which CC license?
@mcc @coachtony different types, but all CC-BY. We also let creators add clauses like NC, SA etc. I’m reading the the attribution requirement might be a blocker for AI use, but I haven’t found any lawsuits that prove this.

@BartV @coachtony It seems to me that if a court held a derivative work of your scraped data was unbound by license requirements like attribution unless you add a magic additional "noAI" tag (which there probably wasn't even a standard for at the time the data was uploaded), this would be utterly absurd.

I am also not aware of any lawsuits proving any of these scraped large models are *legal*.

@coachtony is there a copyleft license designed for writing?
@coachtony It doesn’t matter what license you use: copyright terms are ignored wholesale when scraping content for datasets used to train these large language/image models. Most of the work they use is copyrighted, but they claim it falls under fair use, an unproven assertion at the center of multiple lawsuits right now.
@andybaio @coachtony This will be interesting to watch. I’m unexpectedly sympathetic to the idea that copyright owners should have some interest in what comes out of the learning meat grinder just because the magnification power is so large at the other end. It’ll be like the difference between Public Enemy’s Bomb Squad making wholly new material from sampled parts vs. Superfreak repackaged by MC Hammer or Jay-Z. How different is new?