Anthropic destroyed millions of print books to build its AI models
Company hired Google's book-scanning chief to cut up and digitize "all the books in the world."
https://arstechnica.com/ai/2025/06/anthropic-destroyed-millions-of-print-books-to-build-its-ai-models/?utm_brand=arstechnica&utm_social-type=owned&utm_source=mastodon&utm_medium=social
@arstechnica I am totally blind. When I scan print books, I often ruin them because I have to either press down on them if I use a flatbed scanner, or hold them open if I use a document scanner. Making books available online (provided they're in accessible formats), means that more won't have to be destroyed and those of us who must rely on screen readers and ocr won't have to spend hours scanning just so we can read the books.

@dandylover1 @arstechnica Have you contacted people who scan books on a large scale, like Carl Malamud's projects?

They have tools that seem to be able to do scanning without destroying the books. And they don't have a lot of money. They are friendly folks who are probably more than happy to share ideas, plans, and know sources for such machines.

@karlauerbach @dandylover1 @arstechnica Specifically, it's usually not one, but two scanners, and instead of scanners they're normal off the shelf digital cameras. Books simply rest comfortably on two plates of glass that meet at an angle, with each camera pointed at its own page, which means there is no need to destroy, or even carefully unbind anything. You just open the book as far as it was designed to be opened in the first place.
@TheRealPomax @karlauerbach @arstechnica I would love something like that. I have a Pearl camera, which has a wonderful stand and guide to help place the book, but it still just lies flat on the table, so if it's prone to closing or doesn't open all the way, I still have to hold it to ensure that everything scans properly.
@dandylover1 @karlauerbach @arstechnica there's a few "how to make one yourself" tutorials out there, but I'll be honest, if I had to make one myself I'd find a much handier friend to make one for me and then pay them in food and/or drinks =D
@karlauerbach @arstechnica No. Most of what I read is in the public domain, and the only books I scan are ones I own, so usually, a little warping to the spine is okay. It only annoys me if the book is an antique. However, this is truly a wonderful source, and I may ask if they can scan Male and Female Costume by Beau Brummell, so that it can be made accessible for all!

@dandylover1 @arstechnica sure but you are one person who only has access to a flatbed scanner.

Industrial scanners exist that can hold a book open at an angle and scan the page while in the book without damaging them. A company with billions in funding can afford that.

@indiealexh @arstechnica I agree about that. Usually, even I use my Pearl camera, which is far less fdamaging. But if these scanners are available to more than just libraries, museums, and other such institutions, they should definitely use them! Why destroy books if it's not necessary to do so?

@dandylover1 @arstechnica Contrary to quasi-religious belief it‘s absolutely a-ok to destroy a paperback that has many prints in the process of media transformation for accessibility.
It may even save some books from vanishing. I still have tons of obscure, old media that’ll never be made available electronically that I want to transform to save it.
Telling a blind person about destruction free book scanning of paperbacks is misguided.

(Still: make sure to feed your authors and poets!)

@chris @arstechnica Yes. There is a huge difference between scanning a modern copy of an old book and scanning the original! What drives me crazy is when modern paperbacks are copied from old ones, and instead of actually retyping or scanning and correcting them so that the new one is clean and ledgible, they just take a picture of it, so that the new hard copy has the same handwriting, fading,, discolouration, ripped pages, etc. as the original! It basically makes the book useless to me as an alternative to an online scanned copy in similar condition because my software would scan the printed copy as badly as the pdf!
@dandylover1 @arstechnica That‘s a real bizarre, low effort way of reprinting a book. I can understand that creating a real facsimile has a value but most current books just are a media to transport the content.
Some are beautiful works of art or craft of course, but most are not. (Which does not lessen the value of their content.)
@dandylover1 @arstechnica Yeah, but that's not what they did.
×
Anthropic destroyed millions of print books to build its AI models
Company hired Google's book-scanning chief to cut up and digitize "all the books in the world."
https://arstechnica.com/ai/2025/06/anthropic-destroyed-millions-of-print-books-to-build-its-ai-models/?utm_brand=arstechnica&utm_social-type=owned&utm_source=mastodon&utm_medium=social
@arstechnica Even the ones that RFK Jr reads for health and science info. And there's the problem. AI has no discernment.
@arstechnica
Doubtful that even the nazis burned this many books.
@NMBA @arstechnica don't downplay what they did. The intention to censor is far worse than the intention to teach an AI. Still, unless Anthropic makes those scans available to archives for free, I'm mad.

@arstechnica It's not necessary to do this. Machines are available that can scan books without damaging them.

As used for these, for example: https://www.cambridge.org/core/publications/collections/cambridge-library-collection

The machines aren't cheap, mind, and need careful handling.

Cambridge Library Collection

Welcome to Cambridge Core

Cambridge Core
@TimWardCam @arstechnica I came here to say the same, I'm sure I saw a triangle shaped scanner somewhere once. As an added bonus, I bet it's even faster since you don't have to rip the pages off and you can do two pages at once

@arstechnica
God damn, fuck these people.

(Do not talk to me about how they did what conservators do in preserving content digitally. These people are shite.)

@mtconleyuk @arstechnica and conservators typically avoid damaging the original media or restore the original media.
@arstechnica I am totally blind. When I scan print books, I often ruin them because I have to either press down on them if I use a flatbed scanner, or hold them open if I use a document scanner. Making books available online (provided they're in accessible formats), means that more won't have to be destroyed and those of us who must rely on screen readers and ocr won't have to spend hours scanning just so we can read the books.

@dandylover1 @arstechnica Have you contacted people who scan books on a large scale, like Carl Malamud's projects?

They have tools that seem to be able to do scanning without destroying the books. And they don't have a lot of money. They are friendly folks who are probably more than happy to share ideas, plans, and know sources for such machines.

@karlauerbach @dandylover1 @arstechnica Specifically, it's usually not one, but two scanners, and instead of scanners they're normal off the shelf digital cameras. Books simply rest comfortably on two plates of glass that meet at an angle, with each camera pointed at its own page, which means there is no need to destroy, or even carefully unbind anything. You just open the book as far as it was designed to be opened in the first place.
@TheRealPomax @karlauerbach @arstechnica I would love something like that. I have a Pearl camera, which has a wonderful stand and guide to help place the book, but it still just lies flat on the table, so if it's prone to closing or doesn't open all the way, I still have to hold it to ensure that everything scans properly.
@dandylover1 @karlauerbach @arstechnica there's a few "how to make one yourself" tutorials out there, but I'll be honest, if I had to make one myself I'd find a much handier friend to make one for me and then pay them in food and/or drinks =D
@karlauerbach @arstechnica No. Most of what I read is in the public domain, and the only books I scan are ones I own, so usually, a little warping to the spine is okay. It only annoys me if the book is an antique. However, this is truly a wonderful source, and I may ask if they can scan Male and Female Costume by Beau Brummell, so that it can be made accessible for all!

@dandylover1 @arstechnica sure but you are one person who only has access to a flatbed scanner.

Industrial scanners exist that can hold a book open at an angle and scan the page while in the book without damaging them. A company with billions in funding can afford that.

@indiealexh @arstechnica I agree about that. Usually, even I use my Pearl camera, which is far less fdamaging. But if these scanners are available to more than just libraries, museums, and other such institutions, they should definitely use them! Why destroy books if it's not necessary to do so?

@dandylover1 @arstechnica Contrary to quasi-religious belief it‘s absolutely a-ok to destroy a paperback that has many prints in the process of media transformation for accessibility.
It may even save some books from vanishing. I still have tons of obscure, old media that’ll never be made available electronically that I want to transform to save it.
Telling a blind person about destruction free book scanning of paperbacks is misguided.

(Still: make sure to feed your authors and poets!)

@chris @arstechnica Yes. There is a huge difference between scanning a modern copy of an old book and scanning the original! What drives me crazy is when modern paperbacks are copied from old ones, and instead of actually retyping or scanning and correcting them so that the new one is clean and ledgible, they just take a picture of it, so that the new hard copy has the same handwriting, fading,, discolouration, ripped pages, etc. as the original! It basically makes the book useless to me as an alternative to an online scanned copy in similar condition because my software would scan the printed copy as badly as the pdf!
@dandylover1 @arstechnica That‘s a real bizarre, low effort way of reprinting a book. I can understand that creating a real facsimile has a value but most current books just are a media to transport the content.
Some are beautiful works of art or craft of course, but most are not. (Which does not lessen the value of their content.)
@dandylover1 @arstechnica Yeah, but that's not what they did.
@arstechnica curious to know what kind of books they fed it, to know what kind of BS will be produced.
@arstechnica Techbros are barbarians. They are neither educated nor intellectual. Musk is not an outlier. He may be the most pungent among them, but in the end, it's just leaves from the same rotten tree ...

@arstechnica I once read this in a science fiction book and thought this was a ridiculous idea by the author.

WTF??

@arstechnica

More proof that LLM people are arseholes of the highest order.

@arstechnica Have you read Virnor Vinge's novel "Rainbows End"?

In that book libraries are digitized by shredding the books into pieces and scanning the pieces while they fall through the air immediately after the shredding.

Materials that are overlooked and not scanned fall into a pre-history category and effectively fall out of human knowledge.

https://en.wikipedia.org/wiki/Rainbows_End_(Vinge_novel)

Rainbows End (Vinge novel) - Wikipedia

@karlauerbach

Here to fanboy-gush about this book. Soooo good.

@arstechnica They are in such a hurry, they must destroy data to get to it.

@arstechnica

Having seen many books rot away and be thrown out.
Especially the "useless" ones.

I think the cost of one book being transferred from a physical medium into a digital one is a small price to pay.

And I say that as a book lover.

@n_dimension I have too and I agree with you -- but are these digital books actually made available to anyone?

@arstechnica

@arstechnica @Gargron Cutting up ONE copy of a book in order to scan it is not destroying literature and knowledge, y’all.

It’s OK. There are lots and lots of copies of all the books. It’s not 1580 and books aren’t transcribed by hand.

On the plus side, at least the author made a couple of bucks before getting their work stolen.

@arstechnica @Gargron I can’t even bring myself to write in the margin of a book … in pencil.

@arstechnica

I'm torn two fold by this

As someone whose job it once was to digitize books WITHOUT destroying them, I KNOW it's possible to do this non destructively.

But also that job would have been SO much easier if I could have cut the spine off. And TBH, not every book is a one of a kind precious cultural artifact. We can easily sacrifice a copy of nearly every book that's ever been written especially if it means we can archive the contents and make it accessible to more people.

Though whether or not Googles company here is doing the "make it accessible" part for anything other than AI is not known to me.

Ya gotta get that shit into the computer somehow. You gonna type it in?
@arstechnica vernor vinge predicted this would happen in the book "rainbows end" he even got the year right too. Neeto
@arstechnica It's like Savonarola all over again.
@arstechnica books bought in bulk, probably no rare book was destroyed, we create and destroy (hopefully recycle into new books) millions of books every year already.
@arstechnica This surely breaches copyright in many ways: modern day enclosure writ large!