In-context learning in #transformers is one of those mysterious #ML phenomena that needs more attention (no pun intended) from #neuroscientists.

In-context learning is a phenomenon in large language models where the model "learns" a task just by observing some input-output examples, without updating any parameters.
"Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion." (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Interestingly, two recent works (H/T @roydanroy) showed that in-context learning (at least under certain conditions) match solutions found by gradient descent:
1) Transformers learn in-context by gradient descent: https://arxiv.org/abs/2212.07677
2) What learning algorithm is in-context learning? Investigations with linear models: https://arxiv.org/abs/2211.15661

In #neuroscience, synaptic plasticity is generally thought to be the mechanism underlying many of the behavioral improvements that are loosely referred to as learning.

Could in-context #learning be an alternative mechanism underlying at least some behavioral improvements? Given the suggested similarities of the #hippocampus representation learning and transformers (https://arxiv.org/abs/2112.04035), it'd be interesting to see the implications of in-context learning for our understanding of #memory formation in the hippocampus? #NeuroAI

@roydanroy

Sorry for cross posting, but this 🐦​ 🧵​ should be added to this 🐘​🧵​: https://twitter.com/arankomatsuzaki/status/1622666312219598864?s=20&t=OHK_NTK6lOsczegsfnvYTQ

And this paper: "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention" https://arxiv.org/abs/2202.05798

"it expresses the dual form of a linear layer trained by gradient descent as a key-value system storing training patterns as key-value pairs, which computes the output from a test query using attention over the key-value memory"

Aran Komatsuzaki on Twitter

“Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N) Ref: https://t.co/BXQvCV60pa https://t.co/i5lte2kuMW”

Twitter
@ShahabBakht @roydanroy Fascinating! Could this be used to prompt a task beyond the context window size?

@lowrank_adrian @roydanroy

If the induction-head theory (as suggested here: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) is true, then I imagine it basically underlies any prompt-based few shot learning of LLMs.

@ShahabBakht @lowrank_adrian @roydanroy what's the justification to call this 'learning'? isn't this all about biasing the output of the models by modifying the input? as far as I understand the actual learning is occurring on the side of the agent fine tuning the input, which at the moment is a human but in principle could be another model in actual learning mode

@barbosa @lowrank_adrian @roydanroy

It’s beyond prompt engineering. It’s learning in the sense that you give a few pairs of input-output samples that define a task to the model and test it on a new input that wasn’t included in the training pairs. See the two examples in the image.

@ShahabBakht @lowrank_adrian @roydanroy maybe it's my neuroscientist bias, but for me to consider this learning the input would have to be, at testing time, just the green text. obviously that wouldn't work, because this doesn't lead to any change whatsoever in model, which IMO is a requirement to call anything learning. if not, then how is learning defined?

@barbosa @lowrank_adrian @roydanroy

Yes. If you define learning as changes in synaptic weights, then this can’t be called learning. But if you define learning as a positive change in behaviour, then this could be called learning.

@ShahabBakht @lowrank_adrian @roydanroy but there is no change in behavior beyond the input, which sustains the system in the 'right' state to output the expected answer. right? the contextual input puts the network in a state it previously learned. how is this different from a very high dimensional context dependent task (eg @SussilloDavid RNNs)? the ultimate test for actual learning would be to remove the contextual input, the same we would do for animals.

@barbosa @lowrank_adrian @roydanroy @SussilloDavid

Here the difference is that the model hasn’t seen the exact same rule-based task during training (ie the prediction pretraining). It’d be similar to giving some instructions and showing some examples to someone for a new task and they manage to it right away. Would you call that learning?

@ShahabBakht @lowrank_adrian @roydanroy @SussilloDavid

In humans I would be tempted to call it learning because of all we know about them. Because of all we know about transformers, I am not confident it is reasonable call it learning wo butchering neuroscience/psychology semantics. In both the human and model case the input and behavior changed. To clarify what is driving the apparent behavioral change I would remove the input and I would conclude that the human learned but the model didn't.

@ShahabBakht @lowrank_adrian @roydanroy @SussilloDavid

re: exact rule-base training, indeed LLM training is different than RNNs, but after training both can be seen as dynamical systems with specific embedded attractor landscapes. Quite literally like Hopefield networks. In both cases the input posits the network in one of these states. Finally, not to be pedantic, but in the RNNs case you can also train them with noisy inputs so technically not the same training and test set.

@barbosa @ShahabBakht @roydanroy @SussilloDavid I think the notions here come from the few-shot learning / meta-learning terminology (see the famous paper "Language Models are few-shot learners"). There is a qualitative difference in persistence but yet consider this situation: you teach your friend how to play a new game with examples, yet two weeks after he forgot. How would you call that?
@barbosa @ShahabBakht @roydanroy @SussilloDavid On the other hand, when the train data contains TBs of text, hard to be convinced there were no similar examples inside
@lowrank_adrian @ShahabBakht @roydanroy @SussilloDavid agreed, but whether its overfiting is another issue. they definitely learn the TB of examples during training. the question is whether the model is learning from the examples during testing. i think calling that learning based on superficial similarities with how animals learn is misleading

@lowrank_adrian @ShahabBakht @roydanroy @SussilloDavid yeah, those notions are also misleading. using terms from neuroscience/psych loosely is a common sin in ML ;)

your example: it tells that my friend forgot something that they learned (assuming you properly checked back then). forgetting (and learning!) occurs at different timescales, that's not an issue here

@barbosa What about an RNN that within a single sequence is presented a few examples of a task and then performs the same task would that be termed learning? What is "learning" in the first place?

@lowrank_adrian this example is precisely the LLM example we've been discussing, so no

the definition of learning being used loosely here was defined for animals (eg "learning by examples") but this is misleading in this case because the examples/input are still on. nobody called pattern completion by hopfield networks learning, yet that is likely what is happening here. IMO a definition of learning that doesn't include some structural change leads to confusion. happy to be convinced otherwise!

@lowrank_adrian actually, if the examples were not on, but stored in the dynamics, it would still not be learning. the attractor used for storage would need to be learned previously.
@barbosa @lowrank_adrian @ShahabBakht Fascinating conversation. I tend to agree with @barbosa IMHO an agent learns a task/concept/phenomena at a certain level of generality. If it is general enough, then you can prime the agent using contextualized examples to perform the task/interpret the concept/report the phenomena with a specific flavor. If the agent is not trained generally enough, no amount of priming can elicit the desired outputs. 🤔
@adel @barbosa @ShahabBakht I see your point to both of you! Maybe it could be termed "few-shot understanding" then? ;)
@lowrank_adrian @adel @ShahabBakht understanding?! how about few shot intelligence? or few shot sentience?
@barbosa @adel @ShahabBakht I don't think I see the relationship with sentience, but "understanding" could cover the fact of generating novel behavior that solves a never-seen-before task without learning a new connectivity.
@lowrank_adrian @adel @ShahabBakht sorry, i thought you were trolling and trolled back. IMO few shot understanding is even more controversial than few shot learning. these models can definately learn a lot of stuff, it's unclear to me how much they understand

@barbosa @lowrank_adrian @roydanroy @SussilloDavid

If your definition of learning presumes structural changes, then this, by definition, can’t be called learning. I’d say though what they call in-context learning might be the closest to the eureka moment or the moment of insight, which is also sometimes referred to as abrupt learning: eg https://www.cell.com/current-biology/fulltext/S0960-9822(06)00217-X

Or this one: https://www.cell.com/current-biology/fulltext/S0960-9822(22)00598-X

@ShahabBakht @lowrank_adrian @roydanroy @SussilloDavid do you dare formulating a definition of learning? i know it's hard, but I am really uncomfortable including this in a definition learning.

Thank you for the papers!

@barbosa @lowrank_adrian @roydanroy @SussilloDavid

Tough question. To me, acquiring new skills and knowledge is learning. This includes perceiving stimuli that I couldn't perceive before. With this definition, the examples in the papers I posted could also be considered learning.
But I guess we're getting into semantics here.

@ShahabBakht @barbosa @lowrank_adrian @roydanroy rapid learning clearly doesn’t require synaptic changes. Motor adaptation is a prime example of this.

@TrackingActions @ShahabBakht @lowrank_adrian @roydanroy

I don't know enough about rapid learning, but will definitely dig into it!

However, lets say you train a network to learn a ring attractor by training on angles and then during testing you use that ring attractor to store a color. Did the network 0-shot learn to store colors?

You can think of more general cases in which an input modifies a learned dynamical landscape and it suddenly performs a novel computation.

Is this learning?

@TrackingActions @ShahabBakht @barbosa @lowrank_adrian @roydanroy Exactly! We tried to illustrate how this one-shot learning mechanism would be shaped in nets of spiking neurons [1].

A RNN with hundreds of spiking neurons could "learn" in one shot the position of the platform in a morris water maze and run straight through to it on the second trial. This happens although the synaptic weights are frozen! Instead of using ∆W it memorizes the platform location in the network state (e.g. activity silent adaptation or attractors).

This emerged in the network like in a transformer: it had seen many similar tasks before and optimized it's weights to solve them via gradient descent.

The recent plasticity data in CA1 makes our model look too simplistic though... Finding how multiple areas interact in this context would also be very important imo.

[1] Figure 3 in this paper: https://proceedings.neurips.cc/paper/2018/hash/c203d8a151612acf12457e4d67635a95-Abstract.html

Long short-term memory and Learning-to-learn in networks of spiking neurons

@BellecGuillaume @TrackingActions @barbosa @lowrank_adrian @roydanroy

This is also another related paper: https://arxiv.org/abs/2212.10559

“The results prove that [in-context learning] behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level”

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

arXiv.org
@ShahabBakht @roydanroy Wow so that could be @neuralturing ‘s idea of gradient descent in the brain, just in yet another packaging?.. (from Hopfield networks to modern energy based models now to this?)

@ampanmdagaba @roydanroy

Yes. This figure from Oswald et al explains the hypothesis quite well.

@ShahabBakht The exact phenomena was observed in RNNs (on a smaller scale, of course) back in 2001 by Hochreiter et al. (https://link.springer.com/chapter/10.1007/3-540-44668-0_13) and others more recently.

We wrote about how this phenomena could be an alternative to synaptic plasticity for rapid learning in biology: https://www.biorxiv.org/content/10.1101/2021.01.25.428153v1

Others have shown links to biological data in the context of RL: https://www.nature.com/articles/s41593-018-0147-8

A lot of that probably applies directly in the context of Transformers as well?

Learning to Learn Using Gradient Descent

This paper introduces the application of gradient descent methods to meta-learning. The concept of “meta-learning”, i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its...

SpringerLink

@anandsubramoney

Very cool. Thanks for sharing!
If I understand correctly, in the papers you shared, the models were explicitly optimized for learning to learn (or meta-learning), right? In-context learning in LLMs seems to be an emergent behavior without explicit meta-learning objectives.

@ShahabBakht You're right, the RNNs are indeed explicitly optimised for fast (in-context) learning.

But my intuition is that (a) once the training's been done, the dynamics in both cases of are similar, (b) If RNNs were somehow magically scaled up to transformer level, one might see the same emergent property without explicit meta-training.

It's probably not too hard to verify (a); but (b) is a bit harder to check.

@anandsubramoney

Agree with both. Interesting ideas to check.

@ShahabBakht @roydanroy

Shahab, naive question, how does it work (very roughly) if no parameters are updated?

@PessoaBrain @ShahabBakht @roydanroy

My guess is that for transformer models, they are changing their attention based on the context, and thus the "attention weights" take the continuity of the "synaptic weights", and given the depth of those architectures, it allows such sort of complex learning on the fly. Would be curious how it works for RNN. I suspect that the state of the RNN play a similar storage of constrains but this looks harder to reverse engineer.

@PessoaBrain @introspection @roydanroy

Yes, I have the same intuition as @introspection. But mechanistically it’s an active topic of study. One of the papers I mentioned above (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) suggests “induction heads” as the mechanism underlying this phenomenon: “a circuit whose function is to look back over the sequence for previous instances of the current token (call it A), find the token that came after it last time (call it B), and then predict that the same completion will occur again (e.g. forming the sequence [A][B] … [A] → [B]).”

@introspection @PessoaBrain @roydanroy

I’m also curious to see if the same behavior emerges in RNNs. Scale might be an important factor though.

@ShahabBakht @introspection @PessoaBrain @roydanroy I’m noodling on some ideas related to this and it’s way less mysterious to me now. I don’t think any real learning is taking place, it’s a more like a retrieval of an existing template. Hopefully I’ll have a write up on this soon.

@ShahabBakht @roydanroy

Jane Wang did some work related to this, actually:

https://www.nature.com/articles/s41593-018-0147-8

Prefrontal cortex as a meta-reinforcement learning system - Nature Neuroscience

Humans and other mammals are prodigious learners, partly because they also ‘learn how to learn’. Wang and colleagues present a new theory showing how learning to learn may arise from interactions between prefrontal cortex and the dopamine system.

Nature

@tyrell_turing @roydanroy

Very cool. The “emerged prefrontal-based learning algorithm” that they talk about is probably the closest to the in-context learning of LLMs.

Also, this shows that RNNs (LSTMs in this case) can also show the same behavior and it’s not specific to transformers. @introspection If I remember correctly, you were also curious about this.