In-context learning in #transformers is one of those mysterious #ML phenomena that needs more attention (no pun intended) from #neuroscientists.

In-context learning is a phenomenon in large language models where the model "learns" a task just by observing some input-output examples, without updating any parameters.
"Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion." (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Interestingly, two recent works (H/T @roydanroy) showed that in-context learning (at least under certain conditions) match solutions found by gradient descent:
1) Transformers learn in-context by gradient descent: https://arxiv.org/abs/2212.07677
2) What learning algorithm is in-context learning? Investigations with linear models: https://arxiv.org/abs/2211.15661

In #neuroscience, synaptic plasticity is generally thought to be the mechanism underlying many of the behavioral improvements that are loosely referred to as learning.

Could in-context #learning be an alternative mechanism underlying at least some behavioral improvements? Given the suggested similarities of the #hippocampus representation learning and transformers (https://arxiv.org/abs/2112.04035), it'd be interesting to see the implications of in-context learning for our understanding of #memory formation in the hippocampus? #NeuroAI

@ShahabBakht The exact phenomena was observed in RNNs (on a smaller scale, of course) back in 2001 by Hochreiter et al. (https://link.springer.com/chapter/10.1007/3-540-44668-0_13) and others more recently.

We wrote about how this phenomena could be an alternative to synaptic plasticity for rapid learning in biology: https://www.biorxiv.org/content/10.1101/2021.01.25.428153v1

Others have shown links to biological data in the context of RL: https://www.nature.com/articles/s41593-018-0147-8

A lot of that probably applies directly in the context of Transformers as well?

Learning to Learn Using Gradient Descent

This paper introduces the application of gradient descent methods to meta-learning. The concept of “meta-learning”, i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its...

SpringerLink

@anandsubramoney

Very cool. Thanks for sharing!
If I understand correctly, in the papers you shared, the models were explicitly optimized for learning to learn (or meta-learning), right? In-context learning in LLMs seems to be an emergent behavior without explicit meta-learning objectives.

@ShahabBakht You're right, the RNNs are indeed explicitly optimised for fast (in-context) learning.

But my intuition is that (a) once the training's been done, the dynamics in both cases of are similar, (b) If RNNs were somehow magically scaled up to transformer level, one might see the same emergent property without explicit meta-training.

It's probably not too hard to verify (a); but (b) is a bit harder to check.

@anandsubramoney

Agree with both. Interesting ideas to check.