Mastodawn

Adobe w tarapatach – pozew o wykorzystanie pirackich książek w AI

Czy „etyczny” AI da się zbudować na cudzych książkach? Adobe właśnie boleśnie sprawdza, ile kosztuje skrót przez czyjąś bibliotekę.

Czytaj dalej:
https://pressmind.org/adobe-w-tarapatach-pozew-o-wykorzystanie-pirackich-ksiazek-w-ai/

#PressMindLabs #adobe #books3 #pozewzbiorowy #redpajama #slimlm

Carlos Solís Feb 11, 2025

The #RedPajama #LLM is so painfully close to being truly #OpenSource. Just a few tweaks needed:
- Dropping CommonCrawl/C4 entirely
- Fixing the Gutenberg crawler to stick to public domain books
- Filtering arXiv to return only CC-By(-SA) papers
huggingface.co/datasets/togeth…

togethercomputer/RedPajama-Data-1T · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

GripNews Oct 31, 2023

🌘 RedPajama-Data-v2：一個包含30萬億標記的開放數據集，用於訓練大型語言模型 - Together AI
➤ RedPajama-Data-v2數據集：30萬億標記的開放數據集用於訓練大型語言模型
✤ https://together.ai/blog/redpajama-data-v2
RedPajama-Data-v2是一個包含30萬億標記的數據集，從84個CommonCrawl傾印中涵蓋5種語言，並附帶40多個預先計算的數據質量標註，可用於進一步過濾和加權。這是迄今為止針對LLM訓練專門發布的最大公共數據集。
+ 這個數據集對於語言模型的訓練非常有用，提供了大量的高質量數據。
+ 這是一個很棒的資源，對於研究和開發語言模型的人來說非常有價值。
#數據集 #語言模型 #RedPajama

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

Releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

Together AI

Tero Keski-Valkama May 6, 2023

Releasing 3B and 7B #RedPajama-#INCITE family of models including base, instruction-tuned & chat models — #TOGETHER

"The biggest takeaway is the demonstration that performant #LLMs can be built quickly by the open-source community. This work builds on top of our 1.2 trillion token RedPajama dataset, EleutherAI’s #Pythia training code, #FlashAttention from #Stanford and #Together, the #HELM benchmarks from Stanford #CRFM and generous support from #MILA, #EleutherAI & #LAION for compute time on the #Summit #supercomputer within the INCITE program award 'Scalable Foundation Models for Transferable Generalist AI'. We believe these kind of open collaborations, at larger scales, will be behind the best #AI systems of the future. "

https://www.together.xyz/blog/redpajama-models-v1

Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models — TOGETHER

Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned and chat models.

TOGETHER

Show thread

verwaltungslabor.digital May 4, 2023

@MoKaKi Wie schön wäre es doch, wenn auch Open-Source-Sprachmodelle oder deutsche Unternehmen berücksichtigt oder gar priorisiert werden könnten #AlephAlpha #DeepL #RedPajama

Marcel SIneM(S)US Apr 24, 2023

#LLaMA-Nachbau: #RedPajama – erste dezentrale Open-Source-KI mit offenem Datensatz | Developer https://www.heise.de/news/LLaMA-Nachbau-RedPajama-erste-dezentrale-Open-Source-KI-mit-offenem-Datensatz-8971752.html #OpenSource #MaschineLearning #ChatGPT #ArtificialIntelligence

LLaMA-Nachbau: RedPajama – erste dezentrale Open-Source-KI mit offenem Datensatz

Das RedPajama-Projekt hat den Trainingsdatensatz von LLaMA mit über 1,2 Billionen Token reproduziert und stellt ihn Open Source zur Verfügung.

heise online

Erik Jonker Apr 21, 2023

Positive that opensource LLMs and AI like StableLM and RedPajama are gaining traction. Really important as alternatives to the completely closed and not-transparant solutions from Microsoft, Google and OpenAI.
https://github.com/stability-AI/stableLM/
https://www.together.xyz/blog/redpajama
#AI #LLM #StableLM #RedPajama #Opensource

Ethan Black Apr 18, 2023

@survey I'm excited about Large Language Models and open source. This isn't the best example, but #RedPajama: https://news.ycombinator.com/item?id=35600860

RedPajama: Reproduction of LLaMA with friendly license | Hacker News

davidak Apr 18, 2023

NEW #LLaMA Rebuilt From Scratch - FULL #OpenSource

https://www.youtube.com/watch?v=uF86vcwM6Js

A video for everyone who is too lazy to read the announcement themselves (like me lol).

https://www.together.xyz/blog/redpajama

#AI #LLM #GPT4 #Together #RedPajama

NEW LLaMA Rebuilt From Scratch - FULL Open Source

YouTube

Lambert Heller Apr 17, 2023

Very interesting claims from #RedPajama. It seems they are about to build a competitive LLM from scratch, with everything to train these models fully reproducibly, from open training data. If true, highly relevant for FAIR research on / with LLMs.

"The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:

Pre-training data, which needs to be both high quality and have broad coverage

Base models, which are trained at scale on this data

Instruction tuning data and models, which improve the base model to make it usable and safe

Today, we are releasing the first component, pre-training data."

Source: www.together.xyz/blog/redpajam…

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens — TOGETHER

RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.

TOGETHER