Here's something to kick things off over here: in a new paper, we found that GPT-3 matches or exceeds human performance on zero-shot analogical reasoning, including on a text-based version of Raven's Progressive Matrices.

https://arxiv.org/abs/2212.09196v1

Emergent Analogical Reasoning in Large Language Models

The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here, we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of GPT-3) on a range of analogical tasks, including a novel text-based matrix reasoning task closely modeled on Raven's Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.

arXiv.org
Analogical reasoning is often viewed as the quintessential example of the human capacity for abstraction and generalization, allowing us to approach novel problems *zero-shot*, by comparing them to more familiar situations.
Given the recent debates surrounding the reasoning abilities of LLMs, we wondered whether they might be capable of this kind of zero-shot analogical reasoning, and how their performance would stack up against human participants.
We focused primarily on Raven's Progressive Matrices (RPM), a popular visual analogy problem set often viewed as one of the best measures of zero-shot reasoning ability (i.e., fluid intelligence).
We created a text-based benchmark -- Digit Matrices -- closely modeled on RPM, and evaluated both GPT-3 and human participants.
GPT-3 outperformed human participants both when generating answers from scratch, and when selecting from a set of answer choices. Note that this is without *any* training on this task.
We also found that the pattern of human performance on this new task was very close to the pattern seen for standard (visual) RPM problems, suggesting that this task is tapping into similar processes.
GPT-3 also displayed several qualitative effects that were consistent with known features of human analogical reasoning. For instance, it had an easier time solving logic problems when the corresponding elements were spatially aligned.

Finally, we also tested GPT-3 on letter string analogies. @melaniemitchell previously found that GPT-3 performed very poorly on these problems:

https://medium.com/@melaniemitchell.me/can-gpt-3-make-analogies-16436605c446

but it seems that the newest iteration of GPT-3 performs much better.

Can GPT-3 Make Analogies? - Melanie Mitchell - Medium

By Melanie Mitchell. “Can GPT-3 Make Analogies?” is published by Melanie Mitchell.

Medium
Overall, we were shocked that GPT-3 performs so well on these tasks. The question now of course is whether it's solving them in anything like the way that humans do. Does GPT-3 implement, in an emergent way, any of the features posited by cognitive theories, e.g. relational representations, variable-binding, analogical mapping, etc., or has it discovered a completely novel way of performing analogical reasoning? (or are these cognitive theories wrong?) Lots to investigate.
Also necessary to make a few caveats -- GPT-3's reasoning is of course not human-like in every respect. No episodic memory, poor physical reasoning skills, and most notably it's received *far* more training than humans do (though not on these tasks).
Nevertheless, the overall conclusion is that GPT-3 does appear to possess the core features that we associate with analogical reasoning -- the ability to identify complex relational patterns, zero-shot, in novel problems.
@taylorwwebb @achterbrain Thoughts this might pique your interest.
@adel @taylorwwebb This is exactly what I was looking for! Thanks so much for tagging me here!

@adel @taylorwwebb

@taylorwwebb Did you every try how much the performance in the text based presentation matches the image based representation in humans?
In humans the data shows that the representation can matter quite a lot (e.g. see here https://www.pnas.org/doi/10.1073/pnas.1621147114) and I wonder whether the number representation already (partially) solves the compositionality problem for the model?

I like your discussion about the potential for a holistic solve in LLMs - will need to think more about this!

@achterbrain @adel yes we found that human error rates on our digit matrix problems were extremely similar to error rates on the standard visual RPM problems (figure 4 of the paper). I like this study from Duncan et al., but I think it’s not entirely conclusive about the key source of difficulty in RPM. In particular, the separated problems remove two sources of difficulty - object segmentation and correspondence finding. I believe the latter is more important.
@achterbrain @adel by ‘correspondence finding’, I mean the process of determining which elements go together to form a sub-problem. Our task doesn’t require object segmentation, but it arguably does still require correspondence finding.
@taylorwwebb @adel Right, I see! That is a very interesting additional perspective, thank you for elaborating. I will go through your writing in more detail next week, as I find this very intriguing.
If there is anything related that came out after you put your work online and has hence not be referenced, I would be very interesting to hear about it - though I reckon not likely since it is very recent!
@achterbrain @adel nothing more for now, will probably post an update to the paper soon as we’ve also run a bunch of additional tests, will post here when that’s ready.
@achterbrain @taylorwwebb Taylor, Jon and others on their team are doing excellent inquiries and examinations along these lines. Glad it was helpful. :)