In-context learning in #transformers is one of those mysterious #ML phenomena that needs more attention (no pun intended) from #neuroscientists.

In-context learning is a phenomenon in large language models where the model "learns" a task just by observing some input-output examples, without updating any parameters.
"Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion." (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Interestingly, two recent works (H/T @roydanroy) showed that in-context learning (at least under certain conditions) match solutions found by gradient descent:
1) Transformers learn in-context by gradient descent: https://arxiv.org/abs/2212.07677
2) What learning algorithm is in-context learning? Investigations with linear models: https://arxiv.org/abs/2211.15661

In #neuroscience, synaptic plasticity is generally thought to be the mechanism underlying many of the behavioral improvements that are loosely referred to as learning.

Could in-context #learning be an alternative mechanism underlying at least some behavioral improvements? Given the suggested similarities of the #hippocampus representation learning and transformers (https://arxiv.org/abs/2112.04035), it'd be interesting to see the implications of in-context learning for our understanding of #memory formation in the hippocampus? #NeuroAI

@ShahabBakht @roydanroy Fascinating! Could this be used to prompt a task beyond the context window size?

@lowrank_adrian @roydanroy

If the induction-head theory (as suggested here: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) is true, then I imagine it basically underlies any prompt-based few shot learning of LLMs.

@ShahabBakht @lowrank_adrian @roydanroy what's the justification to call this 'learning'? isn't this all about biasing the output of the models by modifying the input? as far as I understand the actual learning is occurring on the side of the agent fine tuning the input, which at the moment is a human but in principle could be another model in actual learning mode

@barbosa @lowrank_adrian @roydanroy

It’s beyond prompt engineering. It’s learning in the sense that you give a few pairs of input-output samples that define a task to the model and test it on a new input that wasn’t included in the training pairs. See the two examples in the image.

@ShahabBakht @lowrank_adrian @roydanroy maybe it's my neuroscientist bias, but for me to consider this learning the input would have to be, at testing time, just the green text. obviously that wouldn't work, because this doesn't lead to any change whatsoever in model, which IMO is a requirement to call anything learning. if not, then how is learning defined?

@barbosa @lowrank_adrian @roydanroy

Yes. If you define learning as changes in synaptic weights, then this can’t be called learning. But if you define learning as a positive change in behaviour, then this could be called learning.

@ShahabBakht @lowrank_adrian @roydanroy but there is no change in behavior beyond the input, which sustains the system in the 'right' state to output the expected answer. right? the contextual input puts the network in a state it previously learned. how is this different from a very high dimensional context dependent task (eg @SussilloDavid RNNs)? the ultimate test for actual learning would be to remove the contextual input, the same we would do for animals.

@barbosa @lowrank_adrian @roydanroy @SussilloDavid

Here the difference is that the model hasn’t seen the exact same rule-based task during training (ie the prediction pretraining). It’d be similar to giving some instructions and showing some examples to someone for a new task and they manage to it right away. Would you call that learning?

@ShahabBakht @lowrank_adrian @roydanroy @SussilloDavid

In humans I would be tempted to call it learning because of all we know about them. Because of all we know about transformers, I am not confident it is reasonable call it learning wo butchering neuroscience/psychology semantics. In both the human and model case the input and behavior changed. To clarify what is driving the apparent behavioral change I would remove the input and I would conclude that the human learned but the model didn't.

@ShahabBakht @lowrank_adrian @roydanroy @SussilloDavid

re: exact rule-base training, indeed LLM training is different than RNNs, but after training both can be seen as dynamical systems with specific embedded attractor landscapes. Quite literally like Hopefield networks. In both cases the input posits the network in one of these states. Finally, not to be pedantic, but in the RNNs case you can also train them with noisy inputs so technically not the same training and test set.

@barbosa @ShahabBakht @roydanroy @SussilloDavid I think the notions here come from the few-shot learning / meta-learning terminology (see the famous paper "Language Models are few-shot learners"). There is a qualitative difference in persistence but yet consider this situation: you teach your friend how to play a new game with examples, yet two weeks after he forgot. How would you call that?

@lowrank_adrian @ShahabBakht @roydanroy @SussilloDavid yeah, those notions are also misleading. using terms from neuroscience/psych loosely is a common sin in ML ;)

your example: it tells that my friend forgot something that they learned (assuming you properly checked back then). forgetting (and learning!) occurs at different timescales, that's not an issue here

@barbosa What about an RNN that within a single sequence is presented a few examples of a task and then performs the same task would that be termed learning? What is "learning" in the first place?

@lowrank_adrian this example is precisely the LLM example we've been discussing, so no

the definition of learning being used loosely here was defined for animals (eg "learning by examples") but this is misleading in this case because the examples/input are still on. nobody called pattern completion by hopfield networks learning, yet that is likely what is happening here. IMO a definition of learning that doesn't include some structural change leads to confusion. happy to be convinced otherwise!

@lowrank_adrian actually, if the examples were not on, but stored in the dynamics, it would still not be learning. the attractor used for storage would need to be learned previously.
@barbosa @lowrank_adrian @ShahabBakht Fascinating conversation. I tend to agree with @barbosa IMHO an agent learns a task/concept/phenomena at a certain level of generality. If it is general enough, then you can prime the agent using contextualized examples to perform the task/interpret the concept/report the phenomena with a specific flavor. If the agent is not trained generally enough, no amount of priming can elicit the desired outputs. 🤔
@adel @barbosa @ShahabBakht I see your point to both of you! Maybe it could be termed "few-shot understanding" then? ;)
@lowrank_adrian @adel @ShahabBakht understanding?! how about few shot intelligence? or few shot sentience?
@barbosa @adel @ShahabBakht I don't think I see the relationship with sentience, but "understanding" could cover the fact of generating novel behavior that solves a never-seen-before task without learning a new connectivity.
@lowrank_adrian @adel @ShahabBakht sorry, i thought you were trolling and trolled back. IMO few shot understanding is even more controversial than few shot learning. these models can definately learn a lot of stuff, it's unclear to me how much they understand
@barbosa @adel @ShahabBakht Oh yeah I see, I'm misusing a lot of terms 😄 What I meant is the claim in ML literature is that LLMs are able to generate novel behavior from prompts, and this is something humans can also do. What is the right concept to capture this?
@lowrank_adrian @adel @ShahabBakht i am not sure. if you really want to make the paralell with humans, which I discourage, maybe "priming"?

@barbosa @lowrank_adrian @adel @ShahabBakht

I would call it smth like "Few-shot contextual inference". In the brain, there would be a higher order region which infers context from the prompt. Then it modulates the LLM "top-down" to do the right thing. No "learning" is required here. The two networks are just doing their jobs, and helping each other. The HO region uses working mem., so no learning (reversible). If the HO appeals to long-term mem., like episodic mem, then learning occurs.

@barbosa @lowrank_adrian @adel @ShahabBakht @roydanroy Episodic memory would only be very sparsely implicated, gated by novelty and salience signals.

I think adding an episodic memory to LLMs is a very interesting research direction. Does anyone know of any recent papers in this direction?

@NeuralEnsemble @[email protected] strictly episodic memory but retrieval-augmented LLMs are a pretty promising thing I think, and quite similar to this idea (see eg https://arxiv.org/pdf/2112.04426.pdf). Other ideas were done for RNNs but I haven't seen them adapted to transformers yet like https://arxiv.org/abs/1608.00318. I like the "contextual inference" lingo, it seems appropriate and I suppose Shahab could agree with that terminology as well :)