Happy to share our new paper “Language model acceptability judgements are not always robust to context” https://arxiv.org/abs/2212.08979! We prepend several kinds of context to minimal linguistic #acceptability test pairs and find #LMs (#OPT, #GPT2) can still achieve strong performance on #BLiMP & #SyntaxGym, except in some interesting cases. 🧵 [1/7]

Joint work with @jon , @kanishka, @amuuueller, @keren fuentes, @roger_p_levy, @Adinawilliams

Language model acceptability judgements are not always robust to context

Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.

arXiv.org
We investigate the stability of LLM’s performance on targeted syntactic evaluations as we vary properties of the input context: a) the length of the context, b) the types of syntactic phenomena
it contains, and c) whether or not there are violations of grammaticality, to better match the evaluation setting to models’ real world sentence processing objectives. [2/7]
We notice that LLMs don’t get thrown off when we prepend unrelated, irrelevant contexts drawn from Wikipedia. And, as we increase the length of the prepended wikipedia context, the acceptability judgements on BLiMP and SyntaxGym stay relatively stable. [3/7]
However, we observe an effect reminiscent of structural priming (https://twitter.com/JumeletJ/status/1567526972074360832?s=20&t=w8EhO1jaTUOillMMWdr1fg) when we prepend minimal pairs with acceptable / grammatical sentences - the acceptability performance improves significantly with larger contexts! This effect happens across all tested models and is even more pronounced when the prepended sentences are drawn from the same domain (i.e. test suite). [4/7]
Jaap Jumelet on Twitter

“✨What do language models know about grammar? In our new TACL paper we approach this question using the Structural Priming paradigm, that has been used to uncover humans' comprehension of abstract syntax. 📜Blog https://t.co/g5Xc7HJXJG 📑Paper https://t.co/RwigQOmZMc 👇Thread”

Twitter
Conversely, the acceptability judgements drastically worsen when the model is exposed to unacceptable / ungrammatical contexts. We term this “anti-priming,” and observe it to be even stronger than the priming results. This degradation, too, is much sharper for contexts drawn from the same domain. [5/7]
We analyzed various similarity features and concluded that the above improvement/decrease in model performance cannot be explained by the lexical or syntactic features alone. This result points to the intrinsic, instruction-free in-context learning ability of LLMs which is likely affecting performance on acceptability judgements. [6/7]
To summarize: while more acceptable sentences in the context boosts accuracy, more unacceptable sentences in the context makes it drop, & unrelated context doesn't affect the performance at all. Our results demonstrate new aspects of in-context learning: models’ acceptability judgements are affected by the granular properties of the context they make the judgements in (test suite domain, acceptability, length). [7/7]