Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…

Paper page - The Common Pile v...
Paper page - The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Join the discussion on this paper page

Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…
Paper page - The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Join the discussion on this paper page

Kann KI auch ohne Urheberrechtsverletzung stark sein? EleutherAI zeigt mit „Common Pile v0.1“, wie ethisches Training mit 8 TB aus freien & lizenzierten Quellen aussehen kann. Reicht das gegen die Großen der Branche? Klick rein & urteile selbst. #EleutherAI #CommonPile #KI 👇
https://www.all-ai.de/news/news24/ki-training-free
Sauber, schlau, stark: So geht KI-Training heute

Comma v0.1 zeigt, was mit legalen Daten möglich ist. Ist das das Ende der Copyright-Diskussion in der KI?

All-AI.de

TechCrunch: EleutherAI releases massive AI training dataset of licensed and open domain text. “The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, […]

https://rbfirehose.com/2025/06/07/techcrunch-eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/

TechCrunch: EleutherAI releases massive AI training dataset of licensed and open domain text | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

Researchers have created a dataset for training and evaluating language models without intellectual property infringement!

The Common Pile includes diverse text sources like web content, books, research papers (from PubMed & ArXiv), and online discussions. ArXiv is not peer-reviewed. 🤔 It's designed to be high-quality and reproducible for NLP & LLM research.

Resource:
https://github.com/r-three/common-pile/blob/main/paper.pdf

#LLM #LargeLanguageModels #Dataset #AIresearch #CommonPile