Dolly 2.0 is a really big deal: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"

My notes so far on trying to run it: https://til.simonwillison.net/llms/dolly-2

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

Introducing Dolly, the first open-source, commercially viable instruction-tuned LLM, enabling accessible and cost-effective AI solutions.

Databricks

One of the most exciting things about Dolly 2.0 is the fine-tuning instruction set, which was hand-built by 5,000 Databricks employees and released under a CC license

Here's that training set in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrickslabs/dolly/blob/master/data/databricks-dolly-15k.jsonl#/data/databricks-dolly-15k?_facet=category

Datasette

@simon @film_girl Typo in the training set. Is that column the model's response? I wonder where it pulled the misspelled name from.
@ironicsans @film_girl no, none of the data in there is generated by a model - it was all manually entered by Databricks staff members
@simon This is a dataset for use by other product builders. Dolly does not have a user interface like Bing or ChatGPT. Is that right? I love that it's open source! Way more trustworthy output this way.
@Defiance no UI yet but you can call a Python function with a prompt and get back a response, so plugging it into a UI should be pretty easy
@simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?
@jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf
@simon Databricks was started by the team that developed Spark, and they have a done a lot of work on GPU optimization. You may have better luck firing up a Databricks instance on AWS or Azure.
@simon very cool. I’ve tried a couple ways of accessing it in azure so far with no success but I’ve never tried huggingface before this
@simon I ran the same code in a Notebook at ml.azure.com with the smallest compute instance, no gpu, and the result came back after 2618, about 45 minutes. “Buzz Aldrin, 1969”
@simon It surprises me, that for something as performance critical as LLMs people use an inefficient language like #Python, where everything, especially GPU access, goes through multiple abstraction layers.
@fell IDK how familiar you are with Python or ML libraries in Python, but for "real" applications (not learning the basics) all of the actual computation of the model is pushed down to native code. Python remains useful as the glue language, as it always has done
@2ck Forgive me if this sounds rude, but don't see the point in leaning and using a "glue" language when you could simply use C/C++ straight away. It allows compiler optimisations throughout the entire program, direct access to operating system features like memory mapping and just less wasted instruction cycles overall. ML is the most intense computing application I can think of, and I really don't get why Python of all things became the de facto standard.
@fell ah. if you mean why Python is in the position it's in, I think it's mostly not technical and more cultural, and to some extent historical accident. like, a few motivated ML folks also liked Python, built libraries that were easy to play around with which others in the community picked up on and built on.
@simon @fell this has been bothering me, too, but I assume Python is serving as a control language for optimized machine code (or should we say “shaders” for GPU) and not where efficiency matters.
@simon I found this Google Collab setup that returns the first man on the moon prompt in a second, but only with the short answer. Using the 8GB version. No setup required! https://colab.research.google.com/drive/1A8Prplbjr16hy9eGfWd3-r34FOuccB2c?usp=sharing#scrollTo=qQIBoZHdGF4I
Google Colaboratory

“A really big deal”—Dolly is a free, open source, ChatGPT-style AI model

Dolly 2.0 could spark a new wave of fully open source LLMs similar to ChatGPT.

Ars Technica