There is no Claude, just other people's code

Michiel Leenaars for @nlnet at @fosdem

@utopiah @nlnet @fosdem just a reminder, the whole training set thing hit devs before chatgpt, with the initial drop of the closed beta of copilot, and I was there. I processed my thoughts on it before it became a discussion of any sort.
@tofu @utopiah @nlnet @fosdem Sorry to ‘cold call’ ask, but can I find your thoughts on this somewhere that I can reference in my research?
@kattebel @utopiah @nlnet @fosdem nope, not really, never wrote them down
@tofu @utopiah @nlnet @fosdem No worries. :-)
I’m researching the impact of these genAI tools on participation in FOSS. So I’m always interested in hearing what other people have to say about it. Have a good day.
@kattebel @utopiah @nlnet @fosdem I just mostly came to terms with the fact that it's stuff I put out there to the public and was mostly wondering if it could be used to enforce copyleft somehow, because that's the thing I care about the most
@tofu @utopiah @nlnet @fosdem That sentiment is shared by many. Though most devs I’ve spoken to so far did not even mention the licensing aspect. Generally they say we can’t put that jack back in the box.

@tofu @utopiah @nlnet @fosdem @kattebel I do fully agree with the Jack or Pandora box argument. That armada has sailed and is across the ocean already.

The more interesting parts now are:

  • getting to reasonable adjudication of what is a copyright worthy individual work of art (e.g. the famous Doom magic numbers) and what is just a random 4 line switch clause or the 5000ths quick sort every person who's seen a CS 101 book from 5000km could have written. In the past, e.g. Copilot famously regurgitated full (or very large parts of), recognizable works of others (e.g. the famous Doom magic numbers + surroundings). Now this might be, especially given the aforementioned trivial cases, not so much the case anymore.
  • Getting to a point where attribution of works used by the Claudes etc in amounts that go past the threshold of copyrightability is possible without resorting to the usually futile nightmare that is after-the-fact snippet detection.

  • Ideally a technical evolution of models enables such attribution automatically and the harnesses that use them (e.g. Claude) enable the same with what they grab from the web while working. Imagine variants of standardized metadata embedded in sites with code snippets for example.

  • Assuming that commonly accepted OSI licenses are to be used in what's consumed one could derive & generate the required 3rd party notices or publication requirements using common, established tooling and allow the human behind the machine to decide if they're willing to accept those or chose different components.

  • Thinking one step further, especially along the "embedded metadata idea, given that the centralized models already have an established payment link from the attached human, one could think about automated (micro-)payments for the providers of the supply chain materials.

  • This could work especially well with common scenarios such as GPL/commercial dual licensing where corporates are often more then happy to rather pay threefiddy to proceed with their product than having to go through the source code publication process. Infrastructure for this could either be provided by a sufficiently small share of the transaction or priced into model access.

#llm #claude #copilot #ai #foss #fosscompliance #compliance #monetization #micropayments #legalbubble

@jti42 @tofu @utopiah @nlnet @fosdem This may be tangentially related,… someone proposed a similar idea for the use of music in podcasts. And of course, this is quite akin to YouTube’s Content Match.
@utopiah @nlnet @fosdem
And there is no meanigfull alttext. Only BS.
@utopiah @fosdem @nlnet They stole the code, some of which is GPL licensed to repackage it and sell it as a service.