Mastodawn

soerentsch Apr 17, 2023

Dolly 2.0 is a really big deal: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"

My notes so far on trying to run it: https://til.simonwillison.net/llms/dolly-2

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

Introducing Dolly, the first open-source, commercially viable instruction-tuned LLM, enabling accessible and cost-effective AI solutions.

Databricks

Show thread

Simon Willison Apr 12, 2023

One of the most exciting things about Dolly 2.0 is the fine-tuning instruction set, which was hand-built by 5,000 Databricks employees and released under a CC license

Here's that training set in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrickslabs/dolly/blob/master/data/databricks-dolly-15k.jsonl#/data/databricks-dolly-15k?_facet=category

Datasette

Show thread

David Friedman Apr 12, 2023

@simon @film_girl Typo in the training set. Is that column the model's response? I wonder where it pulled the misspelled name from.

Show thread

Simon Willison Apr 12, 2023

@ironicsans @film_girl no, none of the data in there is generated by a model - it was all manually entered by Databricks staff members

Show thread

Defiance!Apr 12, 2023

@simon This is a dataset for use by other product builders. Dolly does not have a user interface like Bing or ChatGPT. Is that right? I love that it's open source! Way more trustworthy output this way.

Show thread

Simon Willison Apr 12, 2023

@Defiance no UI yet but you can call a Python function with a prompt and get back a response, so plugging it into a UI should be pretty easy

Show thread

Defiance!Apr 12, 2023

@simon Thanks!

Show thread

Joseph A di Paolantonio Apr 13, 2023

@simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?

Show thread

Simon Willison Apr 13, 2023

@jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf

Show thread

Joseph A di Paolantonio Apr 13, 2023

@simon @donmelton thank you very much for replying

Show thread

Bigbee Apr 13, 2023

@simon Databricks was started by the team that developed Spark, and they have a done a lot of work on GPU optimization. You may have better luck firing up a Databricks instance on AWS or Azure.

Show thread

Tom Resing Apr 14, 2023

@simon very cool. I’ve tried a couple ways of accessing it in azure so far with no success but I’ve never tried huggingface before this

Show thread

Tom Resing Apr 14, 2023

@simon I ran the same code in a Notebook at ml.azure.com with the smallest compute instance, no gpu, and the result came back after 2618, about 45 minutes. “Buzz Aldrin, 1969”

Show thread

Fell Apr 14, 2023

@simon It surprises me, that for something as performance critical as LLMs people use an inefficient language like #Python, where everything, especially GPU access, goes through multiple abstraction layers.

Show thread

🌸🌹2ck 🌱🐇Apr 14, 2023

@fell IDK how familiar you are with Python or ML libraries in Python, but for "real" applications (not learning the basics) all of the actual computation of the model is pushed down to native code. Python remains useful as the glue language, as it always has done

Show thread

Fell Apr 14, 2023

@2ck Forgive me if this sounds rude, but don't see the point in leaning and using a "glue" language when you could simply use C/C++ straight away. It allows compiler optimisations throughout the entire program, direct access to operating system features like memory mapping and just less wasted instruction cycles overall. ML is the most intense computing application I can think of, and I really don't get why Python of all things became the de facto standard.

Show thread

🌸🌹2ck 🌱🐇Apr 14, 2023

@fell ah. if you mean why Python is in the position it's in, I think it's mostly not technical and more cultural, and to some extent historical accident. like, a few motivated ML folks also liked Python, built libraries that were easy to play around with which others in the community picked up on and built on.

Show thread

Hans Gerwitz Apr 14, 2023

@simon @fell this has been bothering me, too, but I assume Python is serving as a control language for optimized machine code (or should we say “shaders” for GPU) and not where efficiency matters.

Show thread

Tom Resing Apr 14, 2023

@simon I found this Google Collab setup that returns the first man on the moon prompt in a second, but only with the short answer. Using the 8GB version. No setup required! https://colab.research.google.com/drive/1A8Prplbjr16hy9eGfWd3-r34FOuccB2c?usp=sharing#scrollTo=qQIBoZHdGF4I

Google Colaboratory

Show thread

Mori Apr 16, 2023

@simon This toot got quoted in Ars Technica. Nice work☺️

8th paragraph

https://arstechnica.com/information-technology/2023/04/a-really-big-deal-dolly-is-a-free-open-source-chatgpt-style-ai-model/

“A really big deal”—Dolly is a free, open source, ChatGPT-style AI model

Dolly 2.0 could spark a new wave of fully open source LLMs similar to ChatGPT.

Ars Technica