RE: https://sfba.social/@drahardja/116311524946860153

the recent “clean room implementation” tools really hammered this home for me.

generated images always felt icky because visual. code felt more diffuse and less emotional for me. even though the process is the same.

but now #FOSS projects can be targeted and stripped of author, copyright, and effort with a single command.

not that I expected much from them lately, but it sure would be nice if the #FSF had an opinion on the end of copyleft.

@hbons Phrased differently it's also the end of copyright and that was always the goal of the Free Software movement or not? With copyleft being a necessary hack for the current system.

Next let's make sure the patent system goes down the same path.

What we're losing here is mostly attribution. (And gaining various other problems but you talked about copyleft specifically)

@slomo @hbons "the end of copyright" only benefits small entities in the mind of MIT techno libertarians of the '80s. Like all libertarians, they fundamentally don't understand the systems they deeply rely on. For instance: if you remove copyright, the people that will benefit are going to be the ones with the most power and resources, as it was before copyright was introduced.
@slomo @hbons And I say "misunderstand" in a very loose sense, here: the MIT techno libertarians wanted to replace the old boot with their own, and were starting to get the money and resources to do so at the time, so they knew perfectly well what they were doing…
@ebassi @hbons I'm not sure that's very different now. Who is most benefiting from the current copyright system if not publishers and e.g. Disney? While in theory it should benefit the actual authors that doesn't seem to be the effect in practice.
@slomo @hbons sure, it's bad; do we need to just roll over and die, then? Removing *all* the guardrails is not going to usher in a better alternative.

@ebassi @slomo @hbons Indeed, the only thing that is going to usher in a better alternative, is to work out what that better alternative is, campaign for it, and implement it.

(As always.)

@pwithnall @ebassi @hbons And whatever the alternative is, the situation has changed in a way that can't be turned back.
@slomo @pwithnall @hbons on this, I do not agree, and it's one of the reasons why I find discussing this stuff with a lot of people pointless: you have already given up, so there's no point in convincing you
@ebassi @pwithnall @hbons What's the alternative you're advocating for then?
@slomo @pwithnall @hbons the alternative is working to do harm reduction in projects that have already adopted permissive contribution guidelines; to introduce less permissive contribution guidelines in projects that haven't; to figure out licenses that strengthen the enforcement of licensing terms; to contribute to legal funds for license enforcement. In short, to put up a bit of resistance, instead of just folding like a lawn chair, and saying: "this is how it is"
@ebassi @slomo @pwithnall @hbons That's too vague for me to understand. Let's focus on one thing at a time. What do you mean by enforcement of licensing terms here?
@nirbheek @slomo @pwithnall @hbons we're talking about genAI-based "clean room" reimplementations; those should not be allowed, unless you can demonstrate that the training data set is actually clean. This should be encoded in the licensing terms and copyright law.

@ebassi @slomo @pwithnall @hbons "should not be allowed" is a moral argument or a legal argument?

If it's a legal argument, it's not clear whether these are actually derived from the original code if the LLM did not access the original codebase during development.

It would be good to get that clarified through some legal processes. I think it would be good for SFC to focus on that.

@nirbheek @slomo @pwithnall @hbons it's both a legal argument, in terms of using the power of courts to enforce it; and a political one for free and open source projects, because FLOSS is still a political movement.

The legal argument is based on court discovery, just like discovery is how it works in cases of using tainted human knowledge in supposedly clean room reimplementations.

@ebassi @slomo @pwithnall @hbons I do not think we are in a position to make legal arguments. Lawyers must cringe when we do that, we are woefully misinformed on the subject. Which is why I said that SFC should do it.

As for it being a political argument, we can talk about it sometime, but I disagree with it on a very fundamental basis. I'll write a blog post one of these days.

@ebassi @nirbheek @pwithnall @hbons Whether something is a "clean room" implementation or not is also not that easy. Maybe the LLM saw other implementations of the same thing during training, but is that different from you having read some other implementation of something some time before in your life and then writing your own? In either case it's not like the LLM or you can recite* any of the originals but you have an abstract model of it and anything else you ever learned that you work from.

* Give it a try: let a recent LLM write you the FreeBSD implementation of /bin/yes that was surely in its training data and is small enough, and then compare to the original. Or maybe more interesting: let an LLM write Rust/GStreamer code, and you'll clearly see that it learned from code I have written. Just like most humans did if you look over github/etc. Unlike what some humans do, you don't see 1:1 copy&paste of whole little helper functions though.

Very different to that is the case when you (or an LLM) actively look, compare, copy another implementation during development. (Which is also probably more common than we'd like to pretend based on all the code I've seen over the years)

I'm sure we're going to have lots of interesting philosophical discussions between lawyers and courts in the future, ideally with outcomes that don't backfire at us.

@slomo @ebassi @nirbheek @hbons Because humans are bad at license and copyright attribution on a small scale, does not mean that LLMs should be allowed to get away with bad license and copyright attribution on a vast scale.

1/3

@slomo @ebassi @nirbheek @hbons As a thought experiment: if there was an LLM which had been trained purely on (say) LGPL-2.1+ code, had low environmental impact, was not funded by VCs who are counting down the time until they turn on the monetisation switch, and which ran local-only and didn’t exfiltrate your stuff to the cloud, and someone used it to rewrite my project, I think I would still be massively pissed off.

Why? Because it’s a social problem.

2/3

@slomo @ebassi @nirbheek @hbons Why rewrite my code rather than contributing to it? Why relicense someone’s project from GPL to something more permissive? What’s the motive?

Both from the point of view of the people who have released FOSS code which has been put in training datasets, and from the point of view of people who receive unexpected LLM contributions, LLMs enable this kind of anti-community/anti-social behaviour at a huge scale.

That’s my current thought-in-progress, anyway.

3/3

@pwithnall @ebassi @nirbheek @hbons Thanks for writing that down!

That's covering many different aspects, and in general I agree on many of those but if I would explain in detail I would write a little book so I'll leave it at this for now until I can write it more concisely 🙂

Just one thing about the motivation of rewrites and relicensing. We had these discussions pre-LLM too, it's just "easier" now to do a rewrite (but then also, who's going to maintain that? Surely not claude with its goldfish memory). Or produce a stream of garbage MRs/etc that nobody can review in finite time (and outsource maintenance of that to us poor humans).

@slomo @pwithnall @ebassi @hbons I would like to add a few things to what has been said.

1) There are some strong assumptions being made about how good LLMs are at replicating the functionality of a project's codebase. Having used LLMs a lot, my model of how they work is that they operate on code as if it's an n-dimensional parameter space, and they tweak parameters till the feedpack loop is satisfied (such as tests pass)[α]. The resulting code is weird and unmaintainable[β].

2) ... unless the user prompting the LLM is an expert in the domain or is a maintainer of the project, and at that point it's really the user projecting their intent by being heavily involved in the creation of the new codebase. This is what happened to chardet.

3) in the case of chardet, I find the case indistinguishable from the maintainer of a codebase rewriting it from scratch and changing the license. It reminded me of how Wim wrote large parts of GStreamer (LGPL), and then went and wrote Pipewire which is very similar to GStreamer and could potentially replace it, and it's MIT licensed.

4) this goes back to the best explanation I have for LLMs: they are amplifiers for the author's intent *and* their capabilities.

5) IMO instead of worrying about what random people can do with LLMs (they cannot achieve much), we should reflect on why *authors* seem to care so little about copyleft nowadays. Open-source "won" but copyleft is in decline and nearing irrelevance. And that has nothing to do with LLMs.

(5/6)

α. with the caveat that the parameterisation is likely to take known-good patterns/paths-on-a-parameter-curve learned from the training data.
β. you only start seeing this once the codebase becomes large enough, say >2000 LoC, and it's easy for inexperienced devs to miss it. LLMs are not a turn-key solution, and cannot be used to make such a solution.

@slomo @pwithnall @ebassi @hbons

6) even in a hypothetical future in which LLMs are able to one-shot a rewrite from LGPL to MIT/BSD, we have to remember to not over-focus on code. The value of a project is in the community and maintainers around it that continue to evolve it as per the changing needs of the world.

There are companies out who hire contractors to write SaaS, sell seats to businesses, then fire all the contractors. Do you think this is a good way to build a product or run a project? Why do we think that this way of doing things is a threat to us if it's done with LLMs this time around?

The whole idea is fundamentally flawed.

@nirbheek Yes, agreed wholeheartedly. In the case of chardet I think there’s too much focus on the LLM aspect, and not enough focus on the question of why the maintainer wants to change the license (and what other avenues they could have (and maybe did, I haven’t checked) explored to achieve that before using the big hammer of rewriting everything.

Even if LLMs need expert steering right now, I could see that changing, so arguments based on that will become irrelevant fast.

@pwithnall @nirbheek for chardet, the maintainer explained the reasoning (and whole process) in https://dan-blanchard.github.io/blog/chardet-rewrite-controversy/

OOC what's your opinion about the Rusr rewrite of the GNU coreutils? Also under a more liberal license and it's probably impossible that not a single of the authors ever looked at the GNU implementation

Everything Claude Saw: A Transparent Account of the Chardet v7 Rewrite

Exactly what Claude accessed from the old chardet codebase during the v7 rewrite, with evidence from the raw conversation transcripts.

Dan Blanchard
@slomo @pwithnall @nirbheek I have been personally skeptical about uutils and I'm also a little unhappy about it being permissively licensed too. I've been told that rewrites into new programming languages are not in itself considered creative enough to be independent rather than derivative, but I don't think that particular theory has been tested yet. Even so, I don't think uutils is not a new creative implementation. It's definitely novel. But I am not sure it isn't derivative.

@slomo @nirbheek I don’t know enough about the rewrite of coreutils to comment on it.

Generally though, I am against more liberal (non-copyleft) licenses, as they are not reciprocal enough.

Pioneering the Future of Code Preservation and AI with StarCoder2 - Software Heritage

Software Heritage’s mission is to collect, preserve, and make the entire body of software source code easily available, especially emphasizing Free and Open Source Software (FOSS) as a digital commons...

Software Heritage
Commonly allowing code to be stolen and used by the powerful is not progress, it is technical feudalism.
#LLM #copyleft