ChatGPT https://openai.com/blog/chatgpt/ uses 2 fine-tuning passes:
1. labeler demonstrations of proper responses
2. Labelers rank multiple outputs and this ranking data is used to compute a reward for reinforcement learning fine-tuning
Introducing ChatGPT

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

It's taken a while but reinforcement learning based fine tuning is finally starting to take off.

My group did our first RL-based "alignment" of a language model in 2019 https://arxiv.org/abs/2001.08764. Done a lot of it since. Glad to see our hunch was right even if we are massively outclassed when it comes to resources for training and human labelers

Reducing Non-Normative Text Generation from Language Models

Large-scale, transformer-based language models such as GPT-2 are pretrained on diverse corpora scraped from the internet. Consequently, they are prone to generating non-normative text (i.e. in violation of social norms). We introduce a technique for fine-tuning GPT-2, using a policy gradient reinforcement learning technique and a normative text classifier to produce reward and punishment values. We evaluate our technique on five data sets using automated and human participant experiments. The normative text classifier is 81-90% accurate when compared to gold-standard human judgments of normative and non-normative generated text. Our normative fine-tuning technique is able to reduce non-normative text by 27-61%, depending on the data set.

arXiv.org

The spiritual successor of our RL-based language model alignment is currently at #NeurIPS: https://arxiv.org/abs/2205.13636

Two frameworks for using RL to tune language models have recently been released for those who want to get into RL-based fine-tuning of LMs:
1. RL4LM: https://github.com/allenai/RL4LMs by @rajammanabrolu
2. TRLX: https://github.com/CarperAI/trlx by Louis Castricato et al at EleutherAI

Both are by former members of my research team

Quark: Controllable Text Generation with Reinforced Unlearning

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.

arXiv.org