My co-lead Carina Kauf and I present: an in-depth investigation of event knowledge in language models.

Using a controlled minimal pairs paradigm, we find that large language models (LLMs) know that “The teacher bought the laptop” is more likely than “The laptop bought the teacher” (an impossible event), but perform below humans on sentences like “The nanny tutored the boy” vs. “The boy tutored the nanny” (a possible but unlikely event).

A 🧵 1/

https://arxiv.org/abs/2212.01488

Event knowledge in large language models: the gap between the impossible and the unlikely

Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.

arXiv.org

Knowledge of event schemas is a vital component of world knowledge. How much of it can be acquired from text corpora via the word-in-context prediction objective?

We test this Q using simple event descriptions. Our main plausibility manipulation is swapping the agent and the patient of an event (The teacher bought the laptop / The laptop bought the teacher).

2/

LLMs are almost perfect when assigning likelihood to possible vs. impossible events but aren’t as good when it comes to likely vs. unlikely events.

(our baseline language models also show this effect)

3/

In follow-up tests, we show that
- LLM scores depend both on plausibility and surface-level factors like word frequency (meaning that distributions for plausible and implausible sentences are highly overlapping)

4/

- LLMs generalize very well between active and passive versions of the same sentence BUT not as well as humans for synonymous sentences (The teacher bought the laptop / The instructor purchased the computer).

5/

- explicit plausibility information emerges in mid LLM layers and then stays high
- implausibility signatures generalize poorly across animate-inanimate (impossible) events and animate-animate (unlikely) events
- a probe trained on both active and passive voice sentences is as successful as a within-voice probe (but a probe trained on only one voice type fails to generalize)

6/

Check out the paper for an interpretation of these results, including a discussion of selectional restrictions, reporter bias, and more!

#LLMs #languagemodels #NLP #eventknowledge #commonsense #interpretability #languageandthought

(here's the paper link again) https://arxiv.org/abs/2212.01488

7/

Event knowledge in large language models: the gap between the impossible and the unlikely

Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.

arXiv.org

This is an international collaboration brought together by @ev_fedorenko and Alessandro Lenci, with vital contributions from @grambelli and Emmanuele Chersoni (and our undergrads Selena She & Zawad Chowdhury).

It's been a crazy run, with zoom calls during the lockdowns of 2020 & coordinated meetings between Boston, Italy, Hong Kong, and sometimes Germany and Russia. Glad the project has finally come to fruition!

8/

Bonus: a preliminary exploration of #ChatGPT responses shows that it might also have an impossible-implausible gap (although a more detailed investigation is of course needed).

9/end

@neuranna in a different setting it seems that when presented with the impossible agent/theme variant, #ChatGPT proceeds as if it were the inverse (possible) variant