Seeing more programmers take the stance of "the plagiarism machine works, so I guess we have accept it".

Contrast that with art and creative industries, where theft and plagiarism aren't new phenomenons. Unions keep these industries alive. And it's laughable to imagine telling an artist that they have to accept a plagiarist in their communities.

Easy to say "let's all be nice to each other" when you're well paid and haven't lost your job yet. Some tough lessons will be learned soon.

You know how you fight corporations? With unions. Picket lines. New contracts blackballing plagiarists. Older workers walking out to support their (financially insecure) younger peers. That's what past generations did to maintain the world we enjoy today.

You do not fight by saying "plagiarists are people too", not picketing, and not walking out.

When you take this stance, you're trying to sound nice, but you're naive -- you're collecting your paycheck and condemning the next generation.

And as an aside: who cares if the plagiarism machines are getting better? I should hope a plagiarist can produce a final product.

It's not a grand observation that theft and plagiarism will help you accomplish a task faster. it's not new information that stealing things is cheaper than making them.

I do not care if Claude produces perfect code. It doesn't, and it won't, but even if it did I would not use it. Because I'm not a fucking plagiarist. And you shouldn't be either.

The action item you can take away from this rant: call things like Claude or the like what they are: plagiarism machines.

That's not a dig -- that's a more accurate description than "artificial intelligence".

When your coworker argues that it's merely automation, be the nerd who corrects them with "automated plagiarism".

Normalize describing gen AI accurately.

I strongly believe if more people understood how it works they would not use it.

@protowlf Do you have a good source (ironically) that sets out the reasoning behind why we should understand what's going on inside one of these models while code is coming out as "plagiarism"?

They absolutely will spit out whole chunks of code that one can point to and say "that was copied from such-and-such without following the license".

But they also don't *always* do that. Are they *always* plagiarizing, or only sometimes?

@protowlf I looked and found https://lawreview.uchicago.edu/online-archive/plagiarism-copyright-and-ai but that is set in the academic and legal sphere, where a whole idea or concept being spit out without the citation to its originator constitutes "plagiarism".

I assume I'm not guilty of plagiarism in the computing field every time I apply a facade or visitor pattern without citing the Gang of Four.

Plagiarism, Copyright, and AI | The University of Chicago Law Review

Critics of generative AI often describe it as a “plagiarism machine.” They may be right, though not in the sense they mean. With rare exceptions, generative AI doesn’t just copy someone else’s creative expression, producing outputs that infringe copyright. But it does get its ideas from somewhere. And it’s quite bad at identifying the source of those ideas. That means that students (and professors, and lawyers, and journalists) who use AI to produce their work generally aren’t engaged in copyright infringement. But they are often passing someone else’s work off as their own, whether or not they know it. While plagiarism is a problem in academic work generally, AI makes it much worse because authors who use AI may be unknowingly taking the ideas and words of someone else. Disclosing that the authors used AI isn’t a sufficient solution to the problem because the people whose ideas are being used don’t get credit for those ideas. Whether or not a declaration that “AI came up with my ideas” is plagiarism, failing to make a good-faith effort to find the underlying sources is a bad academic practice. We argue that AI plagiarism isn’t—and shouldn’t be—illegal. But it is still a problem in many contexts, particularly academic work, where proper credit is an essential part of the ecosystem. We suggest best practices to align academic and other writing with good scholarly norms in the AI environment.

@protowlf I can see this making sense by a sort of "conservation of thought" argument: if those big matrices are definitionally devoid of the spark of original thought, then anything that comes out is definitionally derivative, copied from somewhere even if the sources are so scattered and numerous that they are permanently unidentifiable.

@Forbearance @protowlf The training data is full of open-source code that has software licenses attached to it (GPL, MIT, etc.). These licenses have terms for what you can or cannot do with the source of software, or things you must do to be in compliance.

For example, the MIT license (one of the most "generous" popular licenses) requires people making use of the code to distribute a copy of the license alongside their software. That's similar to giving credit in the art world.

@Forbearance @protowlf

While large companies and their lawyers will argue otherwise, slurping up all this code and ignoring the licenses it comes with is effectively stealing it.

@agersant @protowlf

The MIT license grants the right to "deal in the Software without restriction", though, provided that "[t]he above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

Is that happening when you throw a lot of MIT-licensed code and their license documents in a blender and come out with a bunch of matrices? If the code is encoded in the matrices, the license probably is too.

@agersant @protowlf

When you bring in the GPL, you get the question of whether all those model weights are "software", or whether they in some sense form a single program with all the actual code around them.

@agersant @protowlf

And of course the voracious AI trainers scoop up a lot of code and other things that they *wouldn't* be allowed to just redistribute as-is, things like freeware or non-free software, and those all get thrown in their pot too. If you can't publish a .zip of it that someone can extract with WinZip, you probably can't publish a giant matrix that someone can extract it from by asking nicely.

@agersant @protowlf

Traditionally, doing things like making tables of word trigrams and other stuff broadly under the header of "statistical analysis" has been unregulated by copyright law, and has not required any sort of license, on the theory that it's "just" math and not really anything like copying.

But this way of thinking about it much predates the current popularity of math that steals your ideas, and might not be adequate to the present moment.

@Forbearance @protowlf my thoughts exactly 💯

@agersant @Forbearance

What agersant is saying about licenses is a pretty huge piece of this.

To answer your question directly ("are they always plagiarizing, or only sometimes?"), the short answer is yes, because fundamentally a machine learning algorithm can only output as much as it is input.

But there's a layer of abstraction with LLMs, where it isn't just raw blocks of words being processed, it's relations between words.

@agersant @Forbearance

This is a simplification, but IMO it hits the fundamental point: when a code LLM "invents" something new, it only appears that way because the final words didn't happen to line up with the relations between words it did necessarily plagiarize to produce its output.

But that begs another question: if an LLM is "only" scraping my code to change weights in a model, is that still stealing? I mean, it didn't take my code blocks, it just took the relations between words.

@agersant @Forbearance

And my response to that is best illustrated with an analogy to image file types:

If you download my jpg and save it as a png, you now have a 100% completely different file from mine. It's all new data!

But obviously you still took my image. The meaning of the data is what matters.

LLMs do not have a consciousness, and their output has to be a function of input(s). When my code was scraped, it's meaning was stored in a roundabout format, but ultimately it was stored.

@agersant @Forbearance

Sorry, I edited these posts to hell and back after posting them. Apologies if you read them once and now they are different. I'm done 😅

That's my attempt at explaining my understanding of the topic. If you want a better, more detailed explanation of how LLMs work from someone smarter than me, I would point you toward this 3Blue1Brown video: https://www.youtube.com/watch?v=wjZofJX0v4M

Transformers, the tech behind LLMs | Deep Learning Chapter 5

YouTube
@Forbearance @protowlf someone studied how well coding agents solve simple programming problems in esoteric languages like whitespace and brainfuck. the agents all fail at things like, "compute n factorial for n < 10". this seems to me a pretty strong indication that the coding agents don't have any significant "reasoning" ability, they mainly work by regurgitating semantically-similar-shaped things from their training data.
https://arxiv.org/abs/2603.09678
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.

arXiv.org
@protowlf The sad thing is that a lot of programmers don't see programming as a creative work. They don't care about copyright, licenses or actual skill, because the hype man told them that "ai is the future".

@protowlf And the customers don't care either.
Take Steam for example: If you use AI where the user can see it (textures, models, story, etc.) you have to disclaim it on your store page. If you vibecode the whole thing: you don't have to disclose anything.

People don't view code as art, they view it as a necessary evil.

@protowlf People in tech are going to learn the harsh lesson that people so many other fields have learned,, that rolling over and showing your belly to the investor class and their new toys isn't going to get you a belly rub for being such a good boy. All you're going to get is more work piled on top of you or a swift kick to propel you out the door because you can't defend yourself.