Elias Stengel-Eskin

102 Followers
110 Following
37 Posts

PhD candidate at #JHU #CLSP

#NLP, computational linguistics, grounded and embodied language. Former/current intern at #Microsoft Research undergrad #McGill cogsci

Pronounshe/him

I am on the job market (faculty, postdoc, and industry)! I work on helping ppl communicate with machines, using language to improve AI agents, and examining how people communicate and reason. I'm looking for roles at the intersection of #NLP #ComputationalLinguistics and #AI

I will be at #EMNLP2022: please reach out to chat IRL or digitally!

We found something surprising: across several models and 2 datasets, models were generally pretty well-calibrated.
We use that to create new challenge splits and we release a library for easily computing calibration metrics

So what else can we do with calibrated models?

4/12

Calibration is a superpower in semantic parsing 🦸
It means that you can predict how likely you are to be right BEFORE you execute. This matters especially in physical domains πŸ€– where you might not be able to undo an action: you can't unbreak an egg or un-slice an apple

5/12

The first step in this paper was to look at how well-calibrated semantic parsing models really are. Does a model's confidence accurately represent how likely it is to be right about a prediction? Parsing is a good domain for this, since we know what it means to be right

3/12

For high-confidence inputs, DidYouMean executes as usual, but for low-confidence ones, it rephrases the input and asks the user to confirm. This is inspired by the way good lecturers often rephrase questions before answering them.
But I'm getting ahead of myself...

2/12

🎁 Early Holiday Preprint πŸŽ„

Task-oriented semantic parsing is used in interactive systems, where calibration really matters. We find they're surprisingly well-calibrated and use that to build DidYouMean, a system for confirming user intent

https://arxiv.org/abs/2211.07443

🧡 1/12

Calibrated Interpretation: Confidence Estimation in Semantic Parsing

Task-oriented semantic parsing is increasingly being used in user-facing applications, making measuring the calibration of parsing models especially important. We examine the calibration characteristics of six models across three model families on two common English semantic parsing datasets, finding that many models are reasonably well-calibrated and that there is a trade-off between calibration and performance. Based on confidence scores across three models, we propose and release new challenge splits of the two datasets we examine. We then illustrate the ways a calibrated model can be useful in balancing common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that using model confidence allows us to improve the accuracy on validation programs by 9.6% (absolute) with annotator interactions on only 2.2% of tokens. Using sequence-level confidence scores, we then examine how we can optimize trade-off between a parser's usability and safety. We show that confidence-based thresholding can reduce the number of incorrect low-confidence programs executed by 76%; however, this comes at a cost to usability. We propose the DidYouMean system which balances usability and safety. We conclude by calling for calibration to be included in the evaluation of semantic parsing systems, and release a library for computing calibration metrics.

arXiv.org

🚨 My first mastodon pre-print 🚨
Something I enjoyed about twitter was people sharing papers. It helped me stay on top of research related to my own but more importantly let me get out of my bubble and read more broadly...
so I'll start posting threads here as well:

I'm excited to share some new work focusing on linguistic ambiguity πŸ€·β€β™€οΈ in visual question answering:
https://arxiv.org/pdf/2211.07516.pdf

🧡
1/9

Why should we care about ambiguity in VQA? It's been pointed out (e.g. https://arxiv.org/abs/1908.04342) that VQA annotators often disagree, which can be a problem. Ambiguity is a ✨special✨ type of disagreement that's harder to resolve, and points to more linguistic insights πŸ’‘

2/9

Why Does a Visual Question Have Different Answers?

Visual question answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. To our knowledge, this is the first work that aims to understand why. We propose a taxonomy of nine plausible reasons, and create two labelled datasets consisting of ~45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question which reasons will cause answer differences as well as a novel algorithm for this purpose. Experiments demonstrate the advantage of our approach over several related baselines on two diverse datasets. We publicly share the datasets and code at https://vizwiz.org.

arXiv.org
Computer Science here at the University of Rochester has three tenure-track assistant professor positions available this year. One area we're specifically looking to hire in is AI/HCI, with a focus on NLP, core machine learning, neurosymbolic AI, or AR/VR. If this sounds like you, please consider applying! https://www.cs.rochester.edu/about/recruit.html
Employment

The University of Rochester seeks applicants for three tenure-track assistant professor positions in the Department of Computer Science.

In addition to the CS positions, we also have a tenure-track assistant professor position in our data science institute. We're particularly interested in recruiting someone with expertise in optimization theory and methods. https://www.sas.rochester.edu/dsc/about/jobs.html
Employment