Dolly 2.0 is a really big deal: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"

My notes so far on trying to run it: https://til.simonwillison.net/llms/dolly-2

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

Introducing Dolly, the first open-source, commercially viable instruction-tuned LLM, enabling accessible and cost-effective AI solutions.

Databricks
@simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?
@jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf