"There are several twists that make this case particularly hard to confidently resolve:"
I really expected one of them to be that LLM output isn't subject to copyright under US law. Since a license is a grant of permissions that would not otherwise exist due to copyright, applying a license to LLM output doesn't make any sense.
No one needs explicit permission to use LLM output.
“No Known Copyright” Our Public Domain Mark enables works that are no longer restricted by copyright to be marked as such in a standard and simple way, making them easily discoverable and available to others. Many cultural heritage institutions including museums, libraries and other curators are knowledgeable about the copyright status of paintings, books and…
@simon On the legal side, I am not an expert. But I understand the concerns of moving to a more permissive license regardings the user's freedom.
And my general feeling is, well, generative AI is technically impressive, but its really putting a lot of mess on the planet and humans relations.
I am not entirely stubbornly opposed (:p), otherwise following you would be masochism ;), but I struggle to find benefits in this tools, for us, as a society.
It's clearly not "clean-room" - but as has been pointed out, that may or may not be necessary to relicense. I'd call it a re-implementation, but again it's unclear how that affects licensing - these are very uncharted waters.
But here's a new wrinkle: at least in my current understanding, you can't copyright AI generated code. Doesn't that imply you can't impose a license on it as well? IANAL, but my take is this re-implementation at best makes an unemcumbered implementation, but that's also hinky - can I feed copyrighted code into an AI to strip the copyright? Probably not, but arguably what has happened here, albeit indirectly.
It seems to me the original license was clearly violated when the AI provider ingested the licensed code as part of the training corpus - the resulting AI data is *clearly* a derivative work. Put that in your pipe and smoke it!
(My gut says the real root issue here is copyright started breaking the day it applied to something other than books, and each media change breaks it more. It needs to be replaced by a better system or removed entirely, but so much money is involved by vested interests it never will be...)

TL;DR: The big tech AI company LLMs have gobbled up all of our data, but the damage they have done to open source and free culture communities are particularly insidious. By taking advantage of those who share freely, they destroy the bargain that made free software spread like wildfire.
@simon Obviously! The source code is: that code required to produce the binary. That code was LGPL. It doesn't matter how many algorithms, nor the nature of the algorithms, it goes through to become those 1s and 0s.
#law #lawfare #computerScience #intellectualProperty #licensing #FOSS #GNU #LGPL #MIT #code #softwareEngineering #LLM #codeWashing
@simon im so intrigued by the position (taken by a few of the commenters there) of “this would be ok with a different name” which seems to imply that there isn’t a copyright issue at all (otherwise it would not be fine) but that the license is somehow attached to the package name and not the implementation.
Dan clearly has the right to release software with that name - he’s the maintainer - and in the eyes of some he’s also allowed to produce a clean-room implementation, but he’s not allowed to release that implementation with the other name?
I see the position but it seems very narrow.
> Claude itself was very likely trained on chardet as part of its enormous quantity of training data—though we have no way of confirming this for sure
A note on that: It would be easy to paste snippets of code from the original codebase into Claude and ask it to analyze, attribute, and fill in the next few lines. Depending on what the answers are they may constitute a near certain confirmation.
@simon the situation is interesting and the questions are challenging. The situation is even more complex and undefined if you create a new AI-based implementation based on an existing AI-implementation:
I created a Rust implementation of chardet based on this particular chardet v 7 version. I decided to pick the original LGPL version for this AI-based-on-AI implementation (which is by all numbers much, much faster than V7).
@simon
So it seems this talk from FOSDEM became reality almost in an instant:
@martin one of the authors created https://malus.sh, which looks like satire, but actually does the thing it says.