Harvard wants you to think their 242 billion token dataset is the new "Library of Alexandria"📚, but it's really just a glorified spreadsheet with more footnotes than a law textbook. 🙄 Thank the Simons Foundation for funding this academic snooze fest, where "usability" means getting lost in a maze of search bars and navigation menus. 😂
https://arxiv.org/abs/2506.08300 #HarvardDataset #LibraryOfAlexandria #AcademicSnoozeFest #SimonsFoundation #DataUsability #HackerNews #ngated
Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway https://selfpublishingadvice.org/ai-training-dataset/ #copyrightconcernsinAI #GoogleAIcontribution #OpenAIandMicrosoft #publicdomainbooks #Harvarddataset #AItraining #News
Harvard and Google Release AI Training Dataset with Public Domain Books, Raising Copyright Questions: Self-Publishing News with Dan Holloway

Harvard announces a dataset of 1 million public domain books for AI training, aiming to address inequities in access to high-quality data.

The Self-Publishing Advice Center