Lemmy may be heading down the path of LLMs

https://leminal.space/post/33312956

Lemmy may be heading down the path of LLMs - Leminal Space

Sadly, it seems like Lemmy is going to integrate LLM code going forward: https://github.com/LemmyNet/lemmy/issues/6385 [https://github.com/LemmyNet/lemmy/issues/6385] If you comment on the issue, please try to make sure it’s a productive and thoughtful comment and not pure hate brigading. Consider upvoting the issue to show community interest. Edit: perhaps I should also mention this one here as a similar discussion: https://github.com/sashiko-dev/sashiko/issues/31 [https://github.com/sashiko-dev/sashiko/issues/31] This one concerns the Linux kernel. I hope you’ll forgive me this slight tangent, but more eyes could benefit this one too.

Code written with the help of LLM and being reviewed is different than like what was happening with Lutris where the developer decided to obfuscate their use of AI-generated code.

The approach you suggest to totally ban it, while in principle can agree and I think that’s noble, it could lead to people accusing each other of using AI code where it may or may not have happened, or others just hiding it and trying to submit anyway without the reviewers knowing, which is just counter-productive.

I’ve followed Lemmy development now for 3 years, the devs approach is slow and steady, to a fault in some people’s views. I think it’s a better use of open source resources if we encourage candor and honesty. If the repo gets spammed with AI-generated PRs, then it will probably be blanket banned, but contributors accurately documenting and reporting their usage of AI will help direct reviewers attention to ensure the code is not slop quality or full of hallucinations.

In my opinion, this argument is exactly the same as saying “we can’t enforce people not stealing GPL-licensed code and copy&pasting it into our project, so we might as well allow it and ask them to disclose it.”

You can argue that AI is actually useful, which by the way seems like what they did, and that would more fairly a good policy in my opinion. I think your argument doesn’t.

My argument is that a total ban on AI use is more comparable to saying “Code from any other coding project is not allowed”. It will start unproductive arguments over boilerplate, struct definitions and other commonly used code.

The broadness and vaagueness of “no AI whatsoever” or “no code from any other projects whatsoever” will be more confusing than saying, “if you do copy any code from another project, let us know where from”. Then the PR can be evaluated, rejected if it’s nonfree or just poor quality, rather than incentivizing people to pretend other people’s code is their own, risking bigger consequences for the whole project. People can be honest if they got inspiration from stackoverflow, a reference book, or another project, if they are allowed to be.

I’m not saying AI should be blanket allowed, the submitter needs to understand the code, enough to be able to revise it for errors themselves if the devs point out something. They can’t just say “I asked AI and it’s confident that the code does this and is bug free”.

Then the PR can be evaluated, rejected if it’s nonfree or just poor quality

I don’t get the difficulty of rejecting “if it’s nonfree or just poor quality or known LLM code”.

I don’t think it’s a vague criterion at all. And for many projects, if you tell them it’s from a StackOverflow post, unless you can show it’s not a direct copy they will reject it as well. I don’t see the difference. Now whether you think LLMs are worth the trouble to use is a different discussion, but your argument doesn’t convince me. Many bans aren’t easy to enforce, that doesn’t mean it’s a bad idea.

There is also a responsibility and liability question here. If something turns out to be a copyright issue and the contributor skirted a known rule, the moral judgement may look different than if you knew and included it anyway. (I can’t comment on the legal outcomes since I’m not a lawyer.)

To be specific, the jump you are making is likening LLM output to non-free code, while on the surface level it makes sense, it’s much closer to making stuff based on copied code. In the US at least, there’s clear legal precedent that LLM fabrications are not copyrightable.

Blanket AI bans are enforceable, I’m not arguing against that, it’s just that I don’t think it’s worth instituting, that it’s not a good fit for this project. My argument is that a Lemmy development policy of “please mark which parts of your code are AI-generated and how you used LLMs, and we will evaluate accordingly” is better than “if you indicate anywhere that your code is AI/LLM-generated, we will automatically reject it”.

Beyond memorization: Text generators may plagiarize beyond 'copy and paste'

Language models, possibly including ChatGPT, paraphrase and reuse ideas from training data without citing the source, raising plagiarism concerns.

Penn State News
I don’t mean in any way to imply that your opinion isn’t sound, but simply that I don’t agree with it here in the context of whether the Lemmy devs should accept or not PRs with any reported LLM usage.