🦷 Another preprint 🦷
Information-seeking Qs often contain questionable assumptions that models should be robust to. "When did Marie Curie discover Uranium?" is an example. We propose (QA)^2, a test set evaluating the capacity to handle such Qs. (1/n)

https://arxiv.org/abs/2212.10003

(QA)$^2$: Question Answering with Questionable Assumptions

Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers for information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical "when" question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)$^2$ (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)$^2$, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. Through human rater acceptability on end-to-end QA with (QA)$^2$, we find that current models do struggle with handling questionable assumptions, leaving substantial headroom for progress.

arXiv.org
Anecdotally, even strong models like text-davinci-003 seem to fall trap to questionable assumptions (or false presuppositions), but not many systematic evaluations have been conducted. (2/n)
(QA)^2 (Question-Answering with Questionable Assumptions) is a dataset of naturally-occurring questions posed to a search engine. It contains both questions that contain a questionable assumption and questions that can be answered as valid information-seeking questions. (3/n)
So, a model must be able to handle both kinds of questions to perform well on our evaluation. (4/n)
We used three evaluation tasks: end-to-end QA (abstractive), Questionable Assumption Detection (binary classification), and Assumption Verification (binary classification). End-to-end QA quality was measured by crowdsourced human acceptability. (5/n)
We evaluated a range of QA-specific models (Macaw, REALM) and general-purpose language models (T0, Flan-T5, davinci, text-davinci). We also experimented with various setups including zero-shot, in-context, task decomposition prompting, and few-shot tuning where applicable. (6/n)
We found that the best-performing model (text-davinci-003 with in-context demonstrations) was at 59% human judged acceptability. Questionable Assumption Detection was also difficult, with the best model at 57% accuracy (text-davinci-003 with task decomposition prompting). (7/n)
Verification was slightly less difficult but still challenging, with zero-shot text-davinci-003 at 68% classification accuracy. We expect verification to be an easier task because the first step, assumption detection, is performed by an oracle. (8/n)
Overall, our results support the conclusion that information-seeking questions with questionable assumptions still pose substantial challenges to current QA systems, leaving headroom for progress. (9/n)
Finally, we intend (QA)^2 to be an evaluation-only dataset, so we plan to not provide a large training set. The evaluation set will come paired with a small (n=32) adaptation set for in-context demonstrations or few-shot tuning. (10/n)
We're conducting a final cleanup of the dataset but watch this space for a release shortly! (11/n)
It was really nice working with Phu Mon Htut, Sam Bowman, and @jowenpetty !🌚

Also check out concurrent work on a similar topic by Velocity Yu
et al.:

https://arxiv.org/abs/2211.17257

CREPE: Open-Domain Question Answering with False Presuppositions

Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar topics. Most existing question answering (QA) datasets, in contrast, assume all questions have well defined answers. We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections. Through extensive baseline experiments, we show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct. This is in large part due to difficulty in retrieving relevant evidence passages from a large text corpus. CREPE provides a benchmark to study question answering in the wild, and our analyses provide avenues for future work in better modeling and further studying the task.

arXiv.org