Reducing Computational Cost During Robot Navigation and Human–Robot Interaction with a Human-Inspired Reinforcement Learning Architecture - International Journal of Social Robotics

We present a new neuro-inspired reinforcement learning architecture for robot online learning and decision-making during both social and non-social scenarios. The goal is to take inspiration from the way humans dynamically and autonomously adapt their behavior according to variations in their own performance while minimizing cognitive effort. Following computational neuroscience principles, the architecture combines model-based (MB) and model-free (MF) reinforcement learning (RL). The main novelty here consists in arbitrating with a meta-controller which selects the current learning strategy according to a trade-off between efficiency and computational cost. The MB strategy, which builds a model of the long-term effects of actions and uses this model to decide through dynamic programming, enables flexible adaptation to task changes at the expense of high computation costs. The MF strategy is less flexible but also 1000 times less costly, and learns by observation of MB decisions. We test the architecture in three experiments: a navigation task in a real environment with task changes (wall configuration changes, goal location changes); a simulated object manipulation task under human teaching signals; and a simulated human–robot cooperation task to tidy up objects on a table. We show that our human-inspired strategy coordination method enables the robot to maintain an optimal performance in terms of reward and computational cost compared to an MB expert alone, which achieves the best performance but has the highest computational cost. We also show that the method makes it possible to cope with sudden changes in the environment, goal changes or changes in the behavior of the human partner during interaction tasks. The robots that performed these experiments, whether real or virtual, all used the same set of parameters, thus showing the generality of the method.

SpringerLink
In the last 15-or-so years, Mehdi Khamassi and myself have been playing with the combination of multiple decision-making modules.

We did that in a neuroscience context (how can we explain behavior & neural data using this framework?) and in a robotics context (how can we make robots smarter?).

This time, it's robotics time:
inspired by behavioral neuroscience, we combine two type of reinforcement learning (RL) algorithms:
* the slow to learn but easy to compute QLearning (model-free family of RL)
* the fast to learn but computationally expensive Value Iteration (model-based family of RL)

and we use a meta-controller in order to select, each time a decision has to be made, the best RL module.

And when we say "the best", we mean that we want to try to get the best of both world: learning fast with the minimal computational costs!

How does it work (hand-waving version)?

In a given situation, our meta-controller looks at the past performance of each module:
* how certain the module was about the proposed action?
* how much time was necessary to make the decision?

(technical parenthesis: here we do not compute uncertainty explicitly, but use an easy to compute proxy: the entropy of the action values.
It is far from perfect, but it is good enough, and it is cheap: we don't want to replace heavy decision-making computations by heavy computations to decide how to best save computations)

The results?

It works pretty well in a robotic navigation task, where we basically get the performance of the Value Iteration algorithm (the good but costly one), for only 30% of its computational cost!

OK, but why publish that in a "Social Robotics" journal?

Well, in human-robot interactions tasks, the architecture performs pretty well too, it can take advantage of the human inputs to learn faster, BUT it can also be resilient to human providing erroneous feedback, or omitting to provide feedback.

And we're quite happy with that, because:
* in a robotics context, doing millions of trial before being operational is not an option (it wears out the robot), learning fast with the minimal number of trials is required,
* in an autonomous mobile robotics context, the robot has limited energetic ressources: the less we use it, the longer the operation. And we don't want to rely on computations "in the clouds" because want to avoid situations where the cloud is not reachable...