"I'm concerned about LLM code in #curl and would like to suggest a code ban"

https://github.com/curl/curl/discussions/20972

I'm concerned about LLM code in curl and would like to suggest a code ban (please note this doesn't concern LLM-based code reviews) · curl curl · Discussion #20972

I'm concerned about generative AI LLM code in curl, including AI auto completion use in editors, and I'm wondering whether the project should adopt a policy to ban it. Please note this doesn't invo...

GitHub
@bagder ethical rules cannot be enforced a priori, they can be breached, but they exist to state what a scientific field expects from its participants.
It would make sense, on the basis of the initial post, that a collaborative programming community states as an ethical rule that no use of code generating software whatsoever is made in the submitted code.
(People will argue about word completion.)
@bagder Very good state of the art report about LLM plagiarism. Yeah, that's a difficult problem.
@bagder idk kinda seems like trusting users not to plagiarize would be similar to trusting users not to use a plagiarizing machine, if you wanted to ban llm usage.
@charon but what is "llm usage" ?
@bagder @charon Is it ok to use language models to improve man pages, documentation, the web presence and so on? It's not even only about code.
@djh @bagder @charon I think that’s a separate question. Would it fall under the LLM ban? Yes. Is it ok? For some, the ethical concerns from energy/water consumption + plagiarism to train the models would say that it’s not. But in the end that’s what the policy should decide.

@argentum @bagder @charon agree!

I wanted to add the point of view of language models applied to other artifacts than strictly code (what the issue is about).

Also agree that it's a multifaceted space and refusing to use LLMs on ethical grounds is one valid option.

@bagder Nothing to do with LLM, but machine generated. What: If a tool does auto complete of a function or logic that has not yet been started, it should be considered "machine generated". Typing "i", and having an if else sequence suggested is not that, but if hitting a completion key continually will write a line of code, then that is machine generated. Whole hog templates don't belong in an existing project either. If a tool writes a token without human selection, it isn't original work.

@bagder what do you mean by that?

I thought you gave a nice example in the linked issue.

If someone plagiarizes 24 lines and changes 3, do you think it's still plagiarism?

@bagder i feel like “pretty please don’t use LLMs” is not too bad a policy imo:

  • it shows the project values human contributors, which should help attracting human contributors
  • it implies that nobody will be there to help you untangle LLM spagetti and you can’t hide behind your AI’s output
  • it makes it more likely that knowing LLM-user will pick another project to contribute to
  • IANAL, but it may help the project legaly defend itself if plagiarised code is submitted and merged, it shows an attempt at avoiding plagiarised code
  • though, maybe being more explicit about the goals rather than “pretty please don’t use LLMs” may be better!

    @bagder This is a good question and you raised an important point. Ultimately I agree with Ell1e's suggestion.

    Perhaps it's not wholly enforceable or wholly definable in easy to colour in lines.

    But i think it is necessary.
    If i were to hazard an answer to your question, I would say LLM is code generated by a large language model, even if modified.

    There might be concerns on when contribution from LLMs is meaningful to matter, but I think the problem is larger than the output 1/2

    @bagder LLM's are empowered by theft of other's work and empowers fascism.

    If it an emotional argument sure, but it is one the open source community cannot ignore anymore. A programmer should be able to write code without an LLM.

    auto-completes are not quite the same for we have snippets and the such. 2/2

    @bagder
    how about
    "If you send a PR, it's your responsibility to ensure no plagiarism. By the way, here are all the reasons to think that it's impossible to do when using an LLM."
    ?

    I guess it's more CYA-ish than you'd like, but I think it'd discourage use of LLMs without having to define what use of LLMs is.

    @bagder
    Missed opportunity to reply with a big wall of slop from a LLM and a gif of Data from Star Trek reading very fast.  🖖🏻
    @bagder There's so many questions open that a flat out LLM ban seems bad, as you say in your post.
    But maybe something like "if your contribution turns out to be LLM slop and caused problems, you're banned from contributing to the project."
    The person should have the responsibility for the code, not the tool.
    Although I'd probably ban known LLM agents like Claude to directly make contributions.
    @bagder Of possible relevance: US court rulings contend that AI-generated works aren’t copyrightable here, possibly including works with any AI-generated content.
    @jmwolf @bagder
    And yet there are several U.S. federal and state lawsuits pending disputing copy right violations of llm training data, including pornography content.

    @jmwolf @bagder the second part cannot be derived from the first. You can, to my best knowledge, take public domain code and incorporate it into your copyrighted work.

    IANAL and so on

    @bagder feeling sad that you called curl contributors “online randos”. 😄
    One possible gate could be by asking for detailed description via PR template from new/ unknown contributors. People who routinely use a LLM to code would typically(?) also use it to fill the PR template which would probably easier to detect/identify. Again lots of assumptions but I agree it’s harder to detect code coming from a LLM that had undergone some manual edits.

    @bagder > I don't think anyone questions that LLMs can produce a lot of junk.

    Oh, but they do! I promise you there are people who think LLMs are infallible. The creator of OpenClaw claimed to make 600 unreviewed commits a day and that none of them were slop. Which is a level of bananas delusion on par with AI psychosis (another group of people who don’t think LLMs produce junk). Also, did you know there are people “dating” their chatbot?

    > Before this, we had people copying bad code suggestions from stackoverflow and the ever present "just confused" people. There are many ways to get bad code.

    With the very important difference that LLMs produce bad code *at scale*. But you know this. It’s only because of LLMs that you had to scrap the bug bounty.

    @bagder to respect not weighing in on a project I have not participated in development on before, here's what I would have replied on the issue:

    I think there are generally three issues raised with contributions (notice I did not say code) that involve the use of an LLM:

  • Quality: LLMs can produce a large volume of output and with an error rate above 0.000%, eventually it will introduce an error that may not have been introduced otherwise.

  • Violating the law or license: LLMs were trained on large datasets which were almost definitely illegally obtained but also may contain illegal material themselves which the LLM might reproduce. Here, we define illegal material as anything that, if added to curl in sufficient amount would make having or using,the curl codebase in a way consistent with its license illegal.

  • Enforceability of the license: LLM-produced material is new and it's unclear what the global consensus will end up being regarding its copyrightability and whether licenses for LLM-produced material can even be enforced.

  • Although point 1 is often the most talked about, it is the simplest to deal with. LLMs can introduce errors, but they can also spot them. The code review process is meant to find these errors and as time goes on if it is deficient then the review process is amended. A balance for quality will eventually be found.

    Point 2 is also alarming, but mostly follows point 1 - the code review process should involve a component of trying to identify re-use of prior art or plagarism. Good faith efforts go a long way for open source projects, especially ones as well-run as curl. However, if the project does not require attestation from the submitter of the PR that they did not use illegal content or content they do not have the rights to contribute then the project takes that risk on themselves. I think asking contributors to attest that their submission is original and they can legally contribute it to the project is a simple way to reduce the risk of legal issues.

    Point 3 is the big question mark. Some jurisdictions have created a legal environment where it's not clear LLM output can be copyrighted which has implications as to whether LLM output can even be licensed or if one can enforce any kind of software license for a project with LLM output. This landscape might change drastically and/or rapidly and makes including LLM output a risk, one that I don't see a sufficient mitigation for except to bar submissions that used LLMs and ask contributors to attest that LLMs did not produce the output or meaningfully contribute. I would say using an LLM for reference material or syntax help might be safe, but even with this example, there's no certainty that this would be safe usage.

    “Torrenting from a corporate laptop doesn’t feel right”: Meta emails unsealed

    Meta's alleged torrenting and seeding of pirated books complicates copyright case.

    Ars Technica
    @bagder ell1e sounds a bit like an LLM