Mastodawn

Nik Silver 🇺🇦Mar 5

The chardet open source library relicensed from LGPL to MIT two days ago thanks to a Claude Code assisted "clean room" rewrite - but original author Mark Pilgrim is disputing that the way this was done justifies the change in license - my notes here: https://simonwillison.net/2026/Mar/5/chardet/

Can coding agents relicense open source through a “clean room” implementation of code?

Over the past few months it’s become clear that coding agents are extraordinarily good at building a weird version of a “clean room” implementation of code. The most famous version …

Simon Willison’s Weblog

@simon lots of public conversation logs for the courts to pour over when the time comes

@simon I think I'm of the opinion that it's impossible to clean room something that is open source with an LLM because the exact source is in the training corpus

Cassandrich Mar 5

RE: https://social.treehouse.systems/@dvshkn/116177633529242686

@dvshkn @simon Bingo.

Magnus Ahltorp Mar 9

@dalias @dvshkn @simon Yes, and this phrase is especially clueless:

”explicitly instructed Claude not to base anything on LGPL/GPL-licensed code”

Is there any evidence, or even indication, that putting this into the ”instructions” has any effect whatsoever?

Cassandrich Mar 9

@ahltorp @dvshkn @simon Nope. 🙃

@dvshkn @simon A clean room implementation isn't *required* for something to be non-infringing though, it just makes it easier to prove that you didn't infringe.

@dvshkn @simon One could imagine using a model so small it is provable that it is not capable of reproducing significant chunks of the input. Of course, there may be some echoes of the architecture or whatever, but that seems fair in general whether done by hand or with autocomplete. Whether such a model could be made is an open question.

FULL-FIFO DEVELOPER 🇺🇦🇨🇿Mar 5

@simon omg what a shit thing to do

petes_bread_eqn_xls Mar 5

@lkundrak from the guy Mark passed the torch to, no less!

Brandon Bennett Mar 5

@simon @mattdm I guess this begs the question on if we need better copy left licenses to explicitly prevent actions like this. Is that even possible?

Julian Andres Klode 🏳️‍🌈Mar 5

@simon APIs are copyrightable in the US following Oracle v Google, so by definition the AI output is a derivative work, and the question is whether it constitutes fair use or not?

Gordon Messmer Mar 5

"There are several twists that make this case particularly hard to confidently resolve:"

I really expected one of them to be that LLM output isn't subject to copyright under US law. Since a license is a grant of permissions that would not otherwise exist due to copyright, applying a license to LLM output doesn't make any sense.

No one needs explicit permission to use LLM output.

@gordonmessmer @simon I came to comment the same thing. I would go as far as saying that if the argument for the clean room is that all code was written by Claude Code and not a human, that *must* then lead to the conclusion that the entire code base is not licensed MIT but rather a rare occurance of what Creative Commons calls “No Known Copyright” (https://creativecommons.org/public-domain/pdm/).

Public Domain Mark - Creative Commons

“No Known Copyright” Our Public Domain Mark enables works that are no longer restricted by copyright to be marked as such in a standard and simple way, making them easily discoverable and available to others. Many cultural heritage institutions including museums, libraries and other curators are knowledgeable about the copyright status of paintings, books and…

Creative Commons

penguin42 Mar 5

@simon Oh interesting way to do it; if it was just a language translation then I'd say that's like a book translation and would follow the original copyright; but hmm splitting it through a design document is pretty clever.

@simon On the legal side, I am not an expert. But I understand the concerns of moving to a more permissive license regardings the user's freedom.

And my general feeling is, well, generative AI is technically impressive, but its really putting a lot of mess on the planet and humans relations.

I am not entirely stubbornly opposed (:p), otherwise following you would be masochism ;), but I struggle to find benefits in this tools, for us, as a society.

flere-imsaho 🇺🇦Mar 5

@simon nice laundering.

Tom Bortels Mar 5

It's clearly not "clean-room" - but as has been pointed out, that may or may not be necessary to relicense. I'd call it a re-implementation, but again it's unclear how that affects licensing - these are very uncharted waters.

But here's a new wrinkle: at least in my current understanding, you can't copyright AI generated code. Doesn't that imply you can't impose a license on it as well? IANAL, but my take is this re-implementation at best makes an unemcumbered implementation, but that's also hinky - can I feed copyrighted code into an AI to strip the copyright? Probably not, but arguably what has happened here, albeit indirectly.

It seems to me the original license was clearly violated when the AI provider ingested the licensed code as part of the training corpus - the resulting AI data is *clearly* a derivative work. Put that in your pipe and smoke it!

(My gut says the real root issue here is copyright started breaking the day it applied to something other than books, and each media change breaks it more. It needs to be replaced by a better system or removed entirely, but so much money is involved by vested interests it never will be...)

@simon Pretty much what is happening here: https://www.quippd.com/writing/2025/12/17/AIs-unpaid-debt-how-llm-scrapers-destroy-the-social-contract-of-open-source.html

AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

TL;DR: The big tech AI company LLMs have gobbled up all of our data, but the damage they have done to open source and free culture communities are particularly insidious. By taking advantage of those who share freely, they destroy the bargain that made free software spread like wildfire.

Youssuff Quips

doboprobodyne Mar 5

@simon Obviously! The source code is: that code required to produce the binary. That code was LGPL. It doesn't matter how many algorithms, nor the nature of the algorithms, it goes through to become those 1s and 0s.

#law #lawfare #computerScience #intellectualProperty #licensing #FOSS #GNU #LGPL #MIT #code #softwareEngineering #LLM #codeWashing

@simon the argument this is not derivative leans heavily in the determination from JPLag and its self-proclaimed state of the art. This exemplifies again how plagiarism detection tools are falling behind in the red queen race of AI.

Magnus Ahltorp Mar 9

@mnmlst @simon Working around a specific offline-runnable plagiarism tool naively seems trivial by simply training an obfuscation system on the score.

Compare how easy it was to fool earlier ROUGE text summarisation scores:

https://www.dr-hato.se/research/abuserouge.pdf

Tom Insam Mar 5

@simon im so intrigued by the position (taken by a few of the commenters there) of “this would be ok with a different name” which seems to imply that there isn’t a copyright issue at all (otherwise it would not be fine) but that the license is somehow attached to the package name and not the implementation.

Dan clearly has the right to release software with that name - he’s the maintainer - and in the eyes of some he’s also allowed to produce a clean-room implementation, but he’s not allowed to release that implementation with the other name?

I see the position but it seems very narrow.

> Claude itself was very likely trained on chardet as part of its enormous quantity of training data—though we have no way of confirming this for sure

A note on that: It would be easy to paste snippets of code from the original codebase into Claude and ask it to analyze, attribute, and fill in the next few lines. Depending on what the answers are they may constitute a near certain confirmation.

flere-imsaho 🇺🇦Mar 6

@simon is your opinion expressed here and in your blog (namely, that copyright laundering through plagiarism machine is most likely ok) shared by the other members of the python software foundation board of directors?

Matěj Cepl 🇪🇺 🇨🇿 🇺🇦Mar 6

https://github.com/chardet/chardet/issues/327#issuecomment-4010111545

No right to relicense this project · Issue #327 · chardet/chardet

Hi, I'm Mark Pilgrim. You may remember me from such classics as "Dive Into Python" and "Universal Character Encoding Detector." I am the original author of chardet. First off, I would like to thank...

GitHub

Andreas 🌈Mar 8

@simon the situation is interesting and the questions are challenging. The situation is even more complex and undefined if you create a new AI-based implementation based on an existing AI-implementation:

I created a Rust implementation of chardet based on this particular chardet v 7 version. I decided to pick the original LGPL version for this AI-based-on-AI implementation (which is by all numbers much, much faster than V7).

https://github.com/zopyx/chardet-rust

GitHub - zopyx/chardet-rust: Universal character encoding detector for Python — Rust-powered fork of chardet 7.0.

Universal character encoding detector for Python — Rust-powered fork of chardet 7.0. - zopyx/chardet-rust

GitHub

@simon
So it seems this talk from FOSDEM became reality almost in an instant:

https://fosdem.org/2026/schedule/event/SUVS7G-lets_end_open_source_together_with_this_one_simple_trick/

FOSDEM 2026 - Let's end open source together with this one simple trick

@martin one of the authors created https://malus.sh, which looks like satire, but actually does the thing it says.

https://www.youtube.com/watch?v=cahSKUYjuTE

MALUS - Clean Room as a Service | Liberation from Open Source Attribution

@davidak
The talk was so egeregious – should have probably expected that o_O
@simon