Mastodawn

Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…

Paper page - The Common Pile v...

Paper page - The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Join the discussion on this paper page

Carlos Solís 5d ago

Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…

Paper page - The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Join the discussion on this paper page

Caramba Jun 7

Kann KI auch ohne Urheberrechtsverletzung stark sein? EleutherAI zeigt mit „Common Pile v0.1“, wie ethisches Training mit 8 TB aus freien & lizenzierten Quellen aussehen kann. Reicht das gegen die Großen der Branche? Klick rein & urteile selbst. #EleutherAI #CommonPile #KI 👇
https://www.all-ai.de/news/news24/ki-training-free

Sauber, schlau, stark: So geht KI-Training heute

Comma v0.1 zeigt, was mit legalen Daten möglich ist. Ist das das Ende der Copyright-Diskussion in der KI?

All-AI.de

ResearchBuzz: Firehose Jun 7

TechCrunch: EleutherAI releases massive AI training dataset of licensed and open domain text. “The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, […]

https://rbfirehose.com/2025/06/07/techcrunch-eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/

TechCrunch: EleutherAI releases massive AI training dataset of licensed and open domain text | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

devSJR

Jun 6

Researchers have created a dataset for training and evaluating language models without intellectual property infringement!

The Common Pile includes diverse text sources like web content, books, research papers (from PubMed & ArXiv), and online discussions. ArXiv is not peer-reviewed. 🤔 It's designed to be high-quality and reproducible for NLP & LLM research.

Resource:
https://github.com/r-three/common-pile/blob/main/paper.pdf

#LLM #LargeLanguageModels #Dataset #AIresearch #CommonPile