In-context learning in #transformers is one of those mysterious #ML phenomena that needs more attention (no pun intended) from #neuroscientists.

In-context learning is a phenomenon in large language models where the model "learns" a task just by observing some input-output examples, without updating any parameters.
"Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion." (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Interestingly, two recent works (H/T @roydanroy) showed that in-context learning (at least under certain conditions) match solutions found by gradient descent:
1) Transformers learn in-context by gradient descent: https://arxiv.org/abs/2212.07677
2) What learning algorithm is in-context learning? Investigations with linear models: https://arxiv.org/abs/2211.15661

In #neuroscience, synaptic plasticity is generally thought to be the mechanism underlying many of the behavioral improvements that are loosely referred to as learning.

Could in-context #learning be an alternative mechanism underlying at least some behavioral improvements? Given the suggested similarities of the #hippocampus representation learning and transformers (https://arxiv.org/abs/2112.04035), it'd be interesting to see the implications of in-context learning for our understanding of #memory formation in the hippocampus? #NeuroAI

@ShahabBakht @roydanroy Fascinating! Could this be used to prompt a task beyond the context window size?

@lowrank_adrian @roydanroy

If the induction-head theory (as suggested here: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) is true, then I imagine it basically underlies any prompt-based few shot learning of LLMs.

@ShahabBakht @lowrank_adrian @roydanroy what's the justification to call this 'learning'? isn't this all about biasing the output of the models by modifying the input? as far as I understand the actual learning is occurring on the side of the agent fine tuning the input, which at the moment is a human but in principle could be another model in actual learning mode

@barbosa @lowrank_adrian @roydanroy

It’s beyond prompt engineering. It’s learning in the sense that you give a few pairs of input-output samples that define a task to the model and test it on a new input that wasn’t included in the training pairs. See the two examples in the image.

@ShahabBakht @lowrank_adrian @roydanroy maybe it's my neuroscientist bias, but for me to consider this learning the input would have to be, at testing time, just the green text. obviously that wouldn't work, because this doesn't lead to any change whatsoever in model, which IMO is a requirement to call anything learning. if not, then how is learning defined?

@barbosa @lowrank_adrian @roydanroy

Yes. If you define learning as changes in synaptic weights, then this can’t be called learning. But if you define learning as a positive change in behaviour, then this could be called learning.

@ShahabBakht @barbosa @lowrank_adrian @roydanroy rapid learning clearly doesn’t require synaptic changes. Motor adaptation is a prime example of this.

@TrackingActions @ShahabBakht @barbosa @lowrank_adrian @roydanroy Exactly! We tried to illustrate how this one-shot learning mechanism would be shaped in nets of spiking neurons [1].

A RNN with hundreds of spiking neurons could "learn" in one shot the position of the platform in a morris water maze and run straight through to it on the second trial. This happens although the synaptic weights are frozen! Instead of using ∆W it memorizes the platform location in the network state (e.g. activity silent adaptation or attractors).

This emerged in the network like in a transformer: it had seen many similar tasks before and optimized it's weights to solve them via gradient descent.

The recent plasticity data in CA1 makes our model look too simplistic though... Finding how multiple areas interact in this context would also be very important imo.

[1] Figure 3 in this paper: https://proceedings.neurips.cc/paper/2018/hash/c203d8a151612acf12457e4d67635a95-Abstract.html

Long short-term memory and Learning-to-learn in networks of spiking neurons

@BellecGuillaume @TrackingActions @barbosa @lowrank_adrian @roydanroy

This is also another related paper: https://arxiv.org/abs/2212.10559

“The results prove that [in-context learning] behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level”

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

arXiv.org