Mastodawn

"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"

Introducing Dolly, the first open-source, commercially viable instruction-tuned LLM, enabling accessible and cost-effective AI solutions.

Databricks

@simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?

@jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf

@simon @donmelton thank you very much for replying