Common Corpus, an open training set for AI, goes global – and so should support for it

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

#aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube https://walledculture.org/common-corpus-an-open-training-set-for-ai-goes-global-and-so-should-support-for-it/
The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on straining the fair use policy to the detriment of the commons.
Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social

RE: https://bsky.app/profile/did:plc:627gjfohrkofk73ict4hmb6p/post/3lcfu67ppds2n
Bluesky

Bluesky Social
@xolotl @creativecommons
OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
So the #CommonCorpus isn't a global legal solution.
happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613
Common Corpus - a PleIAs Collection

The largest public domain dataset for training LLMs.