Mastodawn

Najoung Kim 🍪Dec 24, 2022

Excited to be involved in organizing Blackbox next year with Sophie Hao, @jaapjumelet, @hmohebbi, @arya and @boknilev!

Najoung Kim 🍪Dec 24, 2022

🔮 #BlackboxNLP will be back in 2023 at #EMNLP2023! ❄ We will keep updates posted on our website: https://blackboxnlp.github.io

While you wait, also check out our YouTube channel: https://youtube.com/@blackboxnlp

BlackboxNLP 2023

Workshop on analyzing and interpreting neural networks for NLP

Analyzing and interpreting neural networks for NLP

Show thread

Najoung Kim 🍪Dec 21, 2022

@jowenpetty that's what a phd is for!

Show thread

Najoung Kim 🍪Dec 21, 2022

Also check out concurrent work on a similar topic by Velocity Yu
et al.:

https://arxiv.org/abs/2211.17257

CREPE: Open-Domain Question Answering with False Presuppositions

Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar topics. Most existing question answering (QA) datasets, in contrast, assume all questions have well defined answers. We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections. Through extensive baseline experiments, we show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct. This is in large part due to difficulty in retrieving relevant evidence passages from a large text corpus. CREPE provides a benchmark to study question answering in the wild, and our analyses provide avenues for future work in better modeling and further studying the task.

arXiv.org

Show thread

Najoung Kim 🍪Dec 21, 2022

It was really nice working with Phu Mon Htut, Sam Bowman, and @jowenpetty !🌚

Show thread

Najoung Kim 🍪Dec 21, 2022

We're conducting a final cleanup of the dataset but watch this space for a release shortly! (11/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

Finally, we intend (QA)^2 to be an evaluation-only dataset, so we plan to not provide a large training set. The evaluation set will come paired with a small (n=32) adaptation set for in-context demonstrations or few-shot tuning. (10/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

Overall, our results support the conclusion that information-seeking questions with questionable assumptions still pose substantial challenges to current QA systems, leaving headroom for progress. (9/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

Verification was slightly less difficult but still challenging, with zero-shot text-davinci-003 at 68% classification accuracy. We expect verification to be an easier task because the first step, assumption detection, is performed by an oracle. (8/n)

Show thread

Najoung Kim 🍪Dec 21, 2022

We found that the best-performing model (text-davinci-003 with in-context demonstrations) was at 59% human judged acceptability. Questionable Assumption Detection was also difficult, with the best model at 57% accuracy (text-davinci-003 with task decomposition prompting). (7/n)