In-context learning in #transformers is one of those mysterious #ML phenomena that needs more attention (no pun intended) from #neuroscientists.

In-context learning is a phenomenon in large language models where the model "learns" a task just by observing some input-output examples, without updating any parameters.
"Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion." (https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

Interestingly, two recent works (H/T @roydanroy) showed that in-context learning (at least under certain conditions) match solutions found by gradient descent:
1) Transformers learn in-context by gradient descent: https://arxiv.org/abs/2212.07677
2) What learning algorithm is in-context learning? Investigations with linear models: https://arxiv.org/abs/2211.15661

In #neuroscience, synaptic plasticity is generally thought to be the mechanism underlying many of the behavioral improvements that are loosely referred to as learning.

Could in-context #learning be an alternative mechanism underlying at least some behavioral improvements? Given the suggested similarities of the #hippocampus representation learning and transformers (https://arxiv.org/abs/2112.04035), it'd be interesting to see the implications of in-context learning for our understanding of #memory formation in the hippocampus? #NeuroAI

@ShahabBakht @roydanroy

Shahab, naive question, how does it work (very roughly) if no parameters are updated?

@PessoaBrain @ShahabBakht @roydanroy

My guess is that for transformer models, they are changing their attention based on the context, and thus the "attention weights" take the continuity of the "synaptic weights", and given the depth of those architectures, it allows such sort of complex learning on the fly. Would be curious how it works for RNN. I suspect that the state of the RNN play a similar storage of constrains but this looks harder to reverse engineer.

@introspection @PessoaBrain @roydanroy

I’m also curious to see if the same behavior emerges in RNNs. Scale might be an important factor though.

@ShahabBakht @introspection @PessoaBrain @roydanroy I’m noodling on some ideas related to this and it’s way less mysterious to me now. I don’t think any real learning is taking place, it’s a more like a retrieval of an existing template. Hopefully I’ll have a write up on this soon.