Users engaged with natural language systems can provide feedback in realtime, and this feedback is a super duper learning signal! So: deploy, train, repeat!

https://arxiv.org/abs/2212.09710

Last PhD paper w/@alsuhr/[email protected] ... 🧵

Continual Learning for Instruction Following from Realtime Feedback

We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.

arXiv.org
Language is acquired by hearing/reading what other say, forming hypotheses, following them, and seeing how others/world react. When systems get instructions from human users, they can do the same, and learn from the feedback humans give.
This feedback is a complex learning signal, but the gist of our technical approach is super simple: this is a contextual bandit scenario! Account for human response patterns, and put together a simple mapping of feedback to rewards.
This simple approach is very effective, and surprisingly robust. Feedback signal is equivalent to supervised demonstrations. You can even start with much weaker models at cost of user experience, but quickly improve dramatically!