Mastodawn

Najoung Kim 🍪Dec 24, 2022

Excited to be involved in organizing Blackbox next year with Sophie Hao, @jaapjumelet, @hmohebbi, @arya and @boknilev!

Najoung Kim 🍪Dec 24, 2022

🔮 #BlackboxNLP will be back in 2023 at #EMNLP2023! ❄ We will keep updates posted on our website: https://blackboxnlp.github.io

While you wait, also check out our YouTube channel: https://youtube.com/@blackboxnlp

BlackboxNLP 2023

Workshop on analyzing and interpreting neural networks for NLP

Analyzing and interpreting neural networks for NLP

Show thread

Najoung Kim 🍪Dec 21, 2022

So, a model must be able to handle both kinds of questions to perform well on our evaluation. (4/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

We used three evaluation tasks: end-to-end QA (abstractive), Questionable Assumption Detection (binary classification), and Assumption Verification (binary classification). End-to-end QA quality was measured by crowdsourced human acceptability. (5/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

(QA)^2 (Question-Answering with Questionable Assumptions) is a dataset of naturally-occurring questions posed to a search engine. It contains both questions that contain a questionable assumption and questions that can be answered as valid information-seeking questions. (3/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

Anecdotally, even strong models like text-davinci-003 seem to fall trap to questionable assumptions (or false presuppositions), but not many systematic evaluations have been conducted. (2/n)

Najoung Kim 🍪Dec 21, 2022

🦷 Another preprint 🦷
Information-seeking Qs often contain questionable assumptions that models should be robust to. "When did Marie Curie discover Uranium?" is an example. We propose (QA)^2, a test set evaluating the capacity to handle such Qs. (1/n)

https://arxiv.org/abs/2212.10003

(QA)$^2$: Question Answering with Questionable Assumptions

Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers for information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical "when" question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)$^2$ (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)$^2$, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. Through human rater acceptability on end-to-end QA with (QA)$^2$, we find that current models do struggle with handling questionable assumptions, leaving substantial headroom for progress.

arXiv.org

Show thread

Najoung Kim 🍪Dec 21, 2022

Work with Tal Linzen and Paul Smolensky 🐈 (14/14!)

Show thread

Najoung Kim 🍪Dec 21, 2022

Here is the link to the paper with other interesting results/discussion! An actual arXiv version will appear tomorrow :) Excited to hear your thoughts, more likely to see them/respond to them if emailed. (13/n) https://najoungkim.github.io/assets/files/Kim_Linzen_Smolensky_uncontrolled_lexical_exposure.pdf

Show thread

Najoung Kim 🍪Dec 21, 2022

BUT NOTE: although we showed overestimation of generalization capacity in pretrained models, the argument we make is based on principle. That is, Even if we had obtained exactly the same numbers, *their interpretation is different if adequate control measures are not implemented* (12/n)