A new paper from Meta about self-alignment of #LLM #chatbots. #SelfAlignment refers to any training process where a model is used to fine-tune its own responses towards some aspect of desirability. In this instance they just want to make the answers generally better by taking into account more information scraped from the web. They gather a lot of text pieces from the web and then ask the model for what question this text would be a good answer for. They get a lot of synthetic question-answer pairs out of which a lot are very bad examples. They further self-curate by asking the model itself to score these pairs and removing the bad ones. Then they fine-tune and get a model which has incorporated more knowledge into it supported by itself and a lot of unstructured data from the web. Some details: 🧐They still need human-annotated high-quality examples as a basis. This is analogous to taking examples from a better LLM chatbot like GPT-4, except GPT-4 is cheaper than humans and could in principle also use distillation from embeddings (if such were available, which they aren't) in addition to just plain example answers. But when fine-tuning the largest models there are no larger models to fine-tune from, except possibly humans, but all these supervised learning approaches approach human level only asymptotically. 🧐It is beneficial to tag the synthetic question-answers with a different system prompt in training, possibly because this allows the model to understand the somewhat different distributions of the data between two kinds of question-answers. They found out that during inference it can help to put *both* the system prompts in as a concatenation, this seems like score hacking to me, analogous to just trying out millions of prompt strings and choosing the one with the highest score. 🧐The answers were ranked by win rate over Davinci-003 model as judged by GPT-4, and in that wins other open source models which haven't been distillation trained, but loses to pretty much all the great models like Vicuna 33B, WizardLLM 13B, Claude, Claude2, ChatGPT and GPT-4. Against human judges this new Humpback model wins against LIMA, Claude, Davinci-003, Guanaco and Falcon-instruct. 🧐The largest proprietary language models still rule sovereign, and smaller open source models trained with distillation training from GPT-4 are dragging along. https://arxiv.org/abs/2308.06259