Mastodawn

Yesterday Cory Doctorow argued that refusal to use LLMs was mere "neoliberal purity culture". I think his argument is a strawman, doesn't align with his own actions and delegitimizes important political actions we need to make in order to build a better cyberphysical world.

EDIT: Diskussions under this are fine, but I do not want this to turn into an ad hominem attack to Cory. Be fucking respectful

https://tante.cc/2026/02/20/acting-ethical-in-an-imperfect-world/

Acting ethically in an imperfect world

Life is complicated. Regardless of what your beliefs or politics or ethics are, the way that we set up our society and economy will often force you to act against them: You might not want to fly somewhere but your employer will not accept another mode of transportation, you want to eat vegan but are […]

Smashing Frames

I really like and admire @pluralistic and have utmost respect for him, and that's why I'm totally baffled about why he is claiming "fruit of the poisoned tree" arguments as cause of LLM scepticism.

The objections to LLMs aren't about origins but about what they they are doing right now: destroying the planet, stealing labour, giving power over knowledge to LLM owners etc.

The objections are nothing to do with LLMs' origins, they're entirely about LLMs' effects in the here and now.

Ian Betteridge Feb 20

@FediThing @tante @pluralistic Some people - in fact quite a lot; if my reading is correct - do indeed argue that LLMs can *never* be ethically used because they are “trained on stolen work”.

@ianbetteridge @FediThing @tante

Performing mathematical analysis on large corpora of published work is not "stealing."

@pluralistic @ianbetteridge @FediThing @tante

It's still profit loss damage curable by income transfer if the illegally acquired data was used to create that profit. Dataset prominence should provide the percentage of profits and prominence is data size but also inference casualty. The primary literature should not be able to be diluted with free intellectual property.

I don't know if any of this is actual case law and I'm not a lawyer.

Cory Doctorow Feb 20

@drdrowland @ianbetteridge @FediThing @tante

You're talking about ways of using models, not the creation of models. It's possible to make a model that does illegal things. But training a model is not illegal.

yes

Hanno Rein Feb 20

@pluralistic @ianbetteridge @FediThing @tante It really depends a bit on the details, doesn't it. If I copy a CD, I also perform some mathematical analysis on it, error checking etc. Maybe I even make a non-exact copy by passing it through some filter to make it sound better. But it's totally different from listening to a bunch of Beatles songs and then getting inspired to write my own songs in a similar style.

Ian Betteridge Feb 20

@pluralistic @FediThing @tante I agree - but that is the (incorrect imho) argument they make.

John Mierau Feb 20

@ianbetteridge @pluralistic @FediThing @tante

Again, great respect to Cory, he's been deadeye sharp in Pluralistic but...

I think once some authors cut an admittedly problematic deal with AI companies to settle their losses, I think the ship has sailed on whether it's theft or not.

O'course much more damaging is the whole 'burning the world for profit and trying to elect comservative-loving AI bros to help them keep doing it' argument.

The tech may be good. The cost it extracts? Nope.

Cory Doctorow Feb 20

@ianbetteridge @FediThing @tante Yup.

James Gleick Feb 20

@pluralistic @ianbetteridge @FediThing @tante “Mathematical analysis” is doing a lot of work here. It could mean gathering meaningless statistics. Or it could mean capturing the qualities (deviations from the average) that make a particular work of art (or author) special, creative, surprising—for use in simulacra.

I think that's harmful, to the culture as a whole, if not to the artworks and artists getting regurgitated.

Cory Doctorow Feb 20

@gleick @ianbetteridge @FediThing @tante

Let's stipulate to that (I don't agree, as it happens, but that's OK). It's still not a copyright infringement to enumerate and analyze the elements of a copyrighted work.

For the record, I think AI art is bad and neither consume nor make it.

James Gleick Feb 20

@pluralistic @ianbetteridge @FediThing @tante I'm not claiming that's copyright infringement. Even if one respects the general framework of copyright, which I know you don’t, it seems hopeless to apply it to this AI mess.

But there is a kind of theft here. Not that it's actionable or measurable. But it’s nontrivial. It's related to questions of impersonation. It's an assault on individuality. Whatever your reasons for thinking AI art is bad (I have some sense), it's related to that, too.

James Gleick Feb 20

@pluralistic @ianbetteridge @FediThing @tante Some authors have taken the view that they deserve some compensation for the use of their books in training the LLMs. Do the transaction and we're hunky-dory. That's not my view. I don't care about compensation; I just don't want my prose regurgitated in the LLMs, for reasons I'm not yet able to express properly. I feel I should have been asked, and I feel violated.

Dave Rahardja Feb 20

@gleick @pluralistic @ianbetteridge @FediThing @tante I think the sense of “theft” that creators feel is directly caused by the fact that the AI industry (as it stands today) is a Ponzi scheme which is fundamentally built on remixing creators’ works and devaluing human labor. I have a feeling that most creators will not feel the same kind of outrage if an educational institution created the same technology for academic use, e.g. to generate insights into online culture and psychology.

In short, the GRIFT (i.e. the particular application of the technology) is the source of the feeling of theft, not the technology itself. I think the tech itself has value when used ethically.

FWIW I agree with Cory here that copyright is the *wrong* framework to use for criticizing AI, because for every case where copyright helps the individual creator, there are hundreds of cases where it helps incumbent megacorporations more.

https://www.humancode.us/2024/05/15/copyright-ai.html

Copyright will not save us from AI

humancode.us

Martijn Vos Feb 20

@drahardja @pluralistic @tante @FediThing @gleick @ianbetteridge

I think there's a couple of aspects to the "theft":
* the theft of material: they're trained on copyrighted material
* the theft of jobs: AI is being used to replace artists/writers/coders; it's the same thing that upset the Luddites
* the theft of style: not only does AI "learn" from the works of others, it can emulate it. On demand. Some artists have very unique, personal styles that are suddenly not their own anymore.

Ian Betteridge Feb 20

@mcv @pluralistic @drahardja @tante @FediThing @gleick I mean, *I* was trained on copyrighted material. So were you. So is everyone. I even regurgitate phrases I’ve read, usually unknowingly.

Martijn Vos Feb 20

@ianbetteridge @pluralistic @drahardja @FediThing @gleick @tante

That is true, and probably the strongest argument to defend LLMs. But LLMs have more explicitly encoded the material they're trained on, and better able to reproduce it than we are. Still, I think copyright is by far the weakest of the three types of theft. And I think the duplication of specific, personal styles is probably the most personal and invasive. In a way it's kind of the same thing, and yet it feels very different to me.

The theft of jobs causes the most damage, but is also kind of unavoidable with many technological advances.

Dave Rahardja Feb 20

@mcv @pluralistic @tante @FediThing @gleick @ianbetteridge Again, the concept of a technology that is “trained” by analyzing copyrighted material is not inherently bad. It’s the way that it is *developed and used* that could be morally questionable.

I bet most creators don’t mind people using AI technology to analyze and remix their works for academic or historical research, or even for search. What they mind is their use to power a Ponzi scheme that destroys human worth.

Ian Betteridge Feb 21

@mcv @pluralistic @drahardja @tante @FediThing @gleick The problem with the "theft of style" argument – and I understand it – is that if you apply the same rules to the same standards to humans then a lot of the so-called creative industries – and individual creators – would be sweating their way through court cases.

Having been threatened by Piet Mondrian's estate over a magazine cover which looked a bit Mondrian-esque, I know how style can be protected, too :) And again, whether someone creates the offending work by AI or Photoshop should make no difference.

(They were actually very nice about it, and it was more a "please don't do that again" than "your court date is next week", but still not the most fun letter I've read.)

Todd Knarr Feb 20

@gleick @pluralistic @ianbetteridge @FediThing @tante It's the regurgitation part. People read your work all the time, and are inspired by it to create their own works. I'm sure you're fine with that. But when they read your work and proceed to regurgitate chunks of it and claim it as their own? Because that's what the LLMs are doing all too often, and the reasons to object to LLMs doing it are the same as the ones to people doing it.

Clayton Slaughter Feb 26

I agree with you, I feel the same way and I have not been published.

I’m really enjoying The Three Ages of Water. I slowed down my read to really enjoy your writing and the amazing breadth of knowledge in it.

James Gleick Feb 26

@schmubba Thank you, and I liked it, too, but I didn't write it. That was @petergleick.

Clayton Slaughter Mar 1

@gleick @petergleick
Boy do I feel like an idiot 🙃
Thanks for writing a great book Peter. I enjoy your posts also James.

Alaric Snell-Pym Feb 20

@pluralistic @gleick @ianbetteridge @FediThing @tante there's been documented cases of LLMs regurgitating stuff from their training set verbatim, which clearly IS copyright infringement; and that means some parts of the training set are.encodrd in the weights of the model, which looks like publishing a copyrighted work to me. If publishing a JPEG of an image without copyright to it would be infringing, isn't publishing a model that can recreate something also infringing?

Alaric Snell-Pym Feb 20

@gleick @ianbetteridge @FediThing @tante

BUT I'm also still a fan of @pluralistic in general, although I disagree with him on some points (such as this); we have more in common than divides us, and I see too many people totally reject somebody over one thing. Sure, if that one thing is nazism, sexism, selfishness, etc - they can go straight in the bin. Something I hold a hope of arguing them around on, however, isn't cause for cancelling :-)

Bruno Nicoletti Feb 20

@pluralistic @ianbetteridge @FediThing @tante If that “mathematical analysis” regurgitates near verbatim works created by other people, it certainly is committing IP theft, and LLMs will happily do that. The “mathematical analysis” is effectively a form of lossy compression on its training data which a prompt can later extract.

Cory Doctorow Feb 20

@bjn @ianbetteridge @FediThing @tante

Once again, you're talking about *using* a model, not training a model.

Also "IP theft" isn't a thing. Perhaps you mean copyright infringement?

Bruno Nicoletti Feb 20

@pluralistic @ianbetteridge @FediThing @tante I’ll give you pedant points for copyright infringement, which is what most people mean by “IP theft”. As for training/using, the difference is somewhat moot. The models are trained to be used, and if trained on copyrighted data without a license, you’ve encoded that data into the model which might then regurgitate it thus facilitating copyright infringement.

Cory Doctorow Feb 20

@bjn @ianbetteridge @FediThing @tante it is a bedrock of copyright law that devices 'capable of sustaining a substantial non-infringing use' are lawful. Decided in 1984 (SCOTUS/Betamax) and repeatedly upheld.

It is categorically untrue that merely because a model's output can infringe copyright that the model is therefore illegal.

There's not much that's truly settled in American limitations and exceptions, but this is.

Cory Doctorow Feb 20

@bjn @ianbetteridge @FediThing @tante 'facilitating copyright infringement' just isn't a thing.

Cory Doctorow Feb 20

@bjn @ianbetteridge @FediThing @tante and as befits UK fair dealing (and related limitations and exceptions), we've had opinions from IPREG affirming that training a model doesn't infringe.

Bruno Nicoletti Feb 20

@pluralistic @ianbetteridge @FediThing @tante Then the laws are not fit for purpose. The whole point of copyright is to encourage people produce works by being sure they get the benefit of those works. If my works can be encoded into a bunch of matrix weights and reproduced without attribution let alone financial recompense, then why should I bother? Google is doing its best to effectively steal the bread out of creators mouths with its AI summaries. It may be legal, but it stinks.

Cory Doctorow Feb 20

@bjn @ianbetteridge @FediThing @tante by all means say 'i don't like this technology' but don't conflate that with 'therefore it is illegal'

Bruno Nicoletti Feb 20

@pluralistic @ianbetteridge @FediThing @tante Well apart from Anthropic having to pay $1.5B for copyright infringement, it’s all above board, 🙄. It’s not a matter of liking the technology or not, machine learning is capable of cool and useful things. However, how LLMs are being used and pushed is both immoral and culturally destructive. I’m surprised you are buying into it.

Gabriel N Feb 20

@bjn @pluralistic @ianbetteridge @FediThing @[email protected] I don’t like what Cory wrote, and I don’t think he’s “buying into it” either.

His explanation is like the one in treaties that prohibit the use of biological weapons, but not their research, development and storage.

Maybe a better question is “Can we protect the creation of cultural artifacts by copyright law?”

Not by these standards, it seems.

GunChleoc Feb 21

@wtrmt @bjn @ianbetteridge @FediThing @pluralistic As a personal example for using technology one prefers not to use, although I do everything else on Linux, I use a piece of proprietary Windows software for my FLOSS translation work, because it means that I can produce a massively bigger amount of UI translations in higher quality than I could produce with Linux tooling. So, I understand that part of the reasoning.
🧵

GunChleoc Feb 21

@wtrmt @bjn @ianbetteridge @FediThing The local LLM is doing things for @pluralistic that hunspell can't do. One way to fix that could be to publish new articles to pluralistic.net only and wait until Gregory and 9o6 are done nitpicking, then publish to the other channels 1 day later?

GunChleoc Feb 21

@wtrmt @bjn @ianbetteridge @FediThing @pluralistic There's definitely a distinction between what is legal and what is moral, and how we see those two things also depend on our culture and can evolve over time.

GunChleoc Feb 21

@wtrmt @bjn @ianbetteridge @FediThing @pluralistic Of course, the massive ingestion of other people's work isn't the only problem, and @pluralistic is already aware of this - the Reverse Centaur problem is mentioned in his article. We have unemployment, deskilling and pollution of our information space caused by the usage, but even more critically, accelerated environmental destruction caused by the training that will still be with us for centuries.

/🧵

GunChleoc Feb 21

Sparked by a discussion elsewhere on phone predictive texting, I think having predictive texting available on a PC in combination with a spell checker might fit Cory's needs even better than an LLM. This way, you can spot your mistakes immediately while you type.

@wtrmt @bjn @ianbetteridge @FediThing @pluralistic

GunChleoc Feb 21

I have this functionality available when translating with MemoQ and it saves so much time, especially as I translate software which has a lot of recurring phrasing. It will pop up a selection that I can choose from via mouse click or keyboard navigation, or I can just ignore the suggestion.

Wouldn't it be great to have this available in @libreoffice ?

@wtrmt @bjn @ianbetteridge @FediThing @pluralistic

Gabriel N Feb 21

@gunchleoc @bjn @ianbetteridge @FediThing @pluralistic I studied illustration at college. I wouldn’t recommend any kid to major in that now, no matter how good they are. There are no entry or mid level jobs for them.

How much time till Miyazaki an Gibbli are made irrelevant by the sheer volume of slop, soon in movie length?

Big actors can protect their image, but how about entry level actors and extras?

Out with all of them.

Gabriel N Feb 21

@gunchleoc @bjn @ianbetteridge @FediThing @pluralistic the same tools that are useful to you are used by big corporations to eliminate jobs.

Mercado Libre —a huge online retail in LATAM, owned by EBay— fired his entire team of User Experience Writers last month and replaced them with an LLM. The more than 120 UXW didn’t see it coming.

Will the LLM do a better job? Nope, and those designers will not find a job doing that again.

GunChleoc Feb 21

@wtrmt In the translation business, they squeeze the rates by pre-translating by LLM and turning everybody into proofreaders on text that looks right but is often slightly off.

This is not what I was talking about - I was talking about traditional Translation Memories. They get trained on your own, personal work or your team's work only. The translator is still doing the work, but I no longer get RSI from lots of manual copy/paste.

@bjn @ianbetteridge @FediThing @pluralistic

Gabriel N Feb 21

@gunchleoc @bjn @ianbetteridge @FediThing @pluralistic yes, LLMs can be used in many ways, and the impact on their application is wide ranging: my sister in law is a freelance certified legal translator, and she no longer has a job.

Who needs legal translator, movie extras, set decorators, designers, illustrators, technical writers, architects, middle managers, artists, writers?

All of them will become burger flippers for all we care.

Gabriel N Feb 21

@gunchleoc @bjn @ianbetteridge @FediThing @pluralistic this is impacting people all over the world, in many creative and technical fields. In other countries it does look like a new wave of colonialism, that now comes to eliminate your work and culture.

In the streets of my neighborhood in Santiago I see slightly different AI slop images promoting things on the sidewalks. Those used to be made by hand, on chalkboards, 2 years ago.

GunChleoc Feb 21

@wtrmt LLMs used for Legal translation? 😱

That's asking for real trouble.

@bjn @ianbetteridge @FediThing @pluralistic

Gabriel N Feb 21

@gunchleoc @bjn @ianbetteridge @FediThing @pluralistic poring over that boring legalese? who cares! It’s way cheaper! Instantaneous!

Ups! The document was badly translated and the visa was rejected. Who’s responsible? Not the LLM.

Gabriel N Feb 21

@bjn @pluralistic @ianbetteridge @FediThing Americans laugh at the legal efforts in France to preserve trades, and at the same time they have been trying to bring back manufacturing industries that required decades to build and that now are in China and other countries.

Those industries need people with knowledge and creativity that the US neither has nor care for.

Now chatGPT came for the service economy: they don’t care for that either.

Else, Someone Feb 20

@bjn
> Then the laws are not fit for purpose. The whole point of copyright is to encourage people produce works

No it exists to protect Netflix from you, and it's a perfect fit for the purpose

@pluralistic @ianbetteridge @FediThing @tante

Else, Someone Feb 20

@pluralistic
> IPREG affirming that training a model doesn't infringe.

What we now take the party line serious?

@bjn @ianbetteridge @FediThing @tante

Else, Someone Feb 20

@pluralistic
> untrue that merely because a model's output can infringe copyright that the model is therefore illegal.

Mhmmm naaah overfitting and memorization are very much a thing, especially in the case of LLM where they've completely given up on controlling data leaks, and where memorization has been demonstrated rather unambiguously e.g. with the suitesparse example...

Not to imply that "illegal" is bad ofc, or that copyright justifiable

@bjn @ianbetteridge @FediThing @tante

Ian Betteridge Feb 21

@nobody @pluralistic @bjn @FediThing @tante Memorisation is very definitely a thing for humans too – ask the ghost of George Harrison, who unconsciously regurgitated "She's so fine" as "My Sweet Lord".

And notably – he got sued for it, and lost, *despite* everyone's acceptance that it wasn't deliberate.

If an LLM regurgitates substantive parts of a work, meeting the legal bar of what would land a human in court, there should be no legal difference – the human who prompted that creation could be sued.

Else, Someone Feb 21

@ianbetteridge
Oh yes, absolutely. With all the obvious _dissimilarities_ between human brains and LLMs that the singularity-cultists choose to ignore, I think many here on the fedi deliberately underestimate how remarkably similar to computers we have actually just confirmed us to be, and how comprehensively this damages the premises of copyright and IP law
@pluralistic @bjn @FediThing @tante

Bruno Nicoletti Feb 23

@ianbetteridge @nobody @pluralistic @FediThing @tante

So it turns out LLMs now happily regurgitate great chunks of copyrighted works, with Anthropic’s model generating near verbatim the entirety of Harry Potter and The Philosopher’s Stone. Not what most people would call “fair use” and possibly the courts as well some time soon.

https://arstechnica.com/ai/2026/02/ais-can-generate-near-verbatim-copies-of-novels-from-training-data/

AIs can generate near-verbatim copies of novels from training data

LLMs memorize more training data than previously thought.

Ars Technica

The Secretbatcave Feb 20

@pluralistic @bjn @ianbetteridge @FediThing @tante

I’d argue that It’s a bit more nuanced. Training and inference are two separate stages with their own rules.

For non profit, academic research, excerpts are allowed to be collected, but not the whole work. You still can’t circumvent DRM either.

Llama might be argued is non profit, but lifting whole works to train on still isn’t allowed.

The Secretbatcave Feb 20

@pluralistic @bjn @ianbetteridge @FediThing @tante whisper from openAI was trained illegally on YouTube data. Google knew it, but didn’t want to risk a ruling that might create a “training” carve out that would prevent them also training on YouTube content.

You could argue that the training its self was legally grey, but the dataset creation certainly wasn’t. OpenAI explicitly did not have a license for that data.

Ian Betteridge Feb 21

@secretbatcave @pluralistic @bjn @FediThing @tante This is why a lot of this is the companies eating themselves. The vast amount of economic value which is "destroyed" by AI rests with enormous corporations, not individual creative people.

The Secretbatcave Feb 21

@ianbetteridge @pluralistic @bjn @FediThing @tante inference is another bag.

To use the Betamax argument; recording for your own use is fair game, *distributing* and or selling it afterwards isn’t.

Knowingly allowing copyright infringement is Risky.

*but*

The key issue here is that people who would be kicking up a stink (publishers of music, news and tv) are all keen to use AI to reduce costs.

Plus Google’s anti copyright campaigning have taken hold

The Secretbatcave Feb 21

@ianbetteridge @pluralistic @bjn @FediThing @tante so unlike Napster, where millions/billions were spent to create and enforce new copyright mechanisms, there hasn’t been here.

Robert Kingett Feb 21

@bjn @pluralistic @ianbetteridge @FediThing @tante I would think it’s plagiarism more than copyright infringement

Peter Kraus Feb 20

@ianbetteridge @FediThing @tante @pluralistic do you think OpenAI is legally and ethically in the right to use published works that are CC-BY-NC, for example?