Mastodawn

An Open Training Set For AI Goes Global

https://fed.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/

An Open Training Set For AI Goes Global

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality …

Techdirt

Walled Culture Feb 25

Common Corpus, an open training set for AI, goes global – and so should support for it

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

#aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube https://walledculture.org/common-corpus-an-open-training-set-for-ai-goes-global-and-so-should-support-for-it/

Le site de Korben [Unofficial]Dec 24

Comment les IA se nourrissent de livres piratés ?

https://web.brid.gy/r/https://korben.info/ia-entrainement-donnees-piratees-books3-common-cor.html

Show thread

Carlos Solís May 29, 2025

The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on straining the fair use policy to the detriment of the commons.

Carlos Solís Feb 14, 2025

Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social

RE: https://bsky.app/profile/did:plc:627gjfohrkofk73ict4hmb6p/post/3lcfu67ppds2n

Bluesky

Bluesky Social

Show thread

poritzj Mar 21, 2024

@xolotl @creativecommons
OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
So the #CommonCorpus isn't a global legal solution.

Nate Angell Mar 21, 2024

happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Common Corpus - a PleIAs Collection

The largest public domain dataset for training LLMs.