Mastodawn

I was disappointed to read Cory Doctorow's post where he got weirdly defensive about his LLM use and started arguing with an imaginary foe.

@tante has a very thoughtful reply here:

https://tante.cc/2026/02/20/acting-ethical-in-an-imperfect-world/
A few further comments, 🧵>>

Acting ethically in an imperfect world

Life is complicated. Regardless of what your beliefs or politics or ethics are, the way that we set up our society and economy will often force you to act against them: You might not want to fly somewhere but your employer will not accept another mode of transportation, you want to eat vegan but are […]

Smashing Frames

Show thread

Prof. Emily M. Bender(she/her)Feb 21

It was particularly disappointing to see Doctorow misconstrue (and thus, if he is believed) undermine the work that many of us are doing to shine a light on the ways in which the ideology of "AI" and the specific ways in which LLMs and other "AI" products are created do real harm.
>>

Show thread

Prof. Emily M. Bender(she/her)Feb 21

I also want to point out (again) the ways in which lumping together all uses of LMs (like the lumping of technologies into "AI") obscures the issues at hand.

Language modeling is a useful component of many technologies that can be built without extractive, exploitative means. Take the automatic transcription built by and for the Māori people -- there's te reo Māori language model that's part of that.
>>

Show thread

Prof. Emily M. Bender(she/her)Feb 21

And the transformer architecture represented an important step forward in language modeling, that brought improvements to things like spell checking (Doctorow's use case).
>>

Show thread

Prof. Emily M. Bender(she/her)

What we argued in Stochastic Parrots, however, was that you can get those benefits of the transformer architecture without ammassing datasets too large to collect with care (meaning consentfully, intentionally, and with the ability to document what's in the data).
>>

Show thread

Robb B. 🇺🇸Feb 21

@emilymbender "too large to collect with care" resonates strongly.

Show thread

Gaëtan Perrault Feb 23

@emilymbender

"Datasets too large to collect with care" is a great line.

Is there anyone actually doing this?

I know Mozilla is trying this (CommonVoice) and they're incredibly unpopular right now. Could we have community support for such an organization?

Show thread

-dsr- (not a dog)Feb 23

@gatesvp @emilymbender

You can get community support for just about anything:

if it's really in the best interest of the community
if it does not intentionally harm others, and adequately compensates those who are harmed by it
if you do the necessary work to discuss these things
and if you are not primarily motivated by greed, and have suitable defenses against exploitation by those who are.

I'll leave you to ruminate of the failures of Mozilla, a nominally non-profit organization.

Show thread

David Huggins-Daines Feb 23

@gatesvp @emilymbender as the other reply says, it's all about informed consent and participation (in other words the exact opposite of what we currently have with LLMs), but also one does not necessarily need enormous data sets to do useful things! Rumors of the demise of "obsolete" technologies like WFSTs have been greatly exaggerated 🙂

Show thread

Bogdan Buduroiu Feb 23

@emilymbender

> without ammassing datasets too large to collect with care

I appreciate your post, and I'm a huge proponent for moving past the transformer.

However, how is worrying about this not choosing "cooperate" every single time in the Prisoner's Dilemma when you know Meta & Co will always choose "defect"?

This is worrying to me, because nefarious actors have no incentive to care about these issues (and actually have plenty of incentives to be hostile to these ideas). Also, they have institutional backing to do so.

Show thread

Prof. Emily M. Bender(she/her)Feb 23

@budududuroiu In what sense is this choosing 'cooperate' and also where do you see a prisoner's dilemma here?

Show thread

Bogdan Buduroiu Feb 23

@emilymbender

The framing is that there is a Commons of human knowledge that is accessible to virtually everyone (based on how internet protocols operate currently). There's a sort of social contract within these Commons which makes participants feel ok to publish permissively, be ok with bearing the costs of compute so that everyone can access the Commons freely.

To 'cooperate' is to abide by this unwritten social contract, respecting /robots.txt, respecting CC.

To 'defect' is to take advantage of this permissiveness to massively scrape the Commons, ignoring the social contract, offloading the costs to the providers in the Commons.

The current SOTA architecture is transformer-based, which requires massive data for training effectively. By cooperating and not engaging in training a 'free as in freedom' LLM, we're 1) losing the benefits of the Commons (as they either get sloppified or people take more information private because of increased compute costs), and 2) we also don't get to build an artifact based on the knowledge of the Commons that can be contributed to the Commons (an open LLM).

If the DeepSeek moment managed to wipe $600bn off Nvidia's market cap, commoditising LLM training would be the death knell of the AI slop hype race, as who would pay thousands in tokens to OAI and Anthropic when you can use a GPL-licensed LLM (or whatever permissive license we come up with)