Elias Stengel-Eskin

102 Followers
110 Following
37 Posts

PhD candidate at #JHU #CLSP

#NLP, computational linguistics, grounded and embodied language. Former/current intern at #Microsoft Research undergrad #McGill cogsci

Pronounshe/him

I am on the job market (faculty, postdoc, and industry)! I work on helping ppl communicate with machines, using language to improve AI agents, and examining how people communicate and reason. I'm looking for roles at the intersection of #NLP #ComputationalLinguistics and #AI

I will be at #EMNLP2022: please reach out to chat IRL or digitally!

With DidYouMean, we end up with much better usability than with the threshold, but ALSO a better safety profile! So we can have the best of both worlds, and all it costs is a fairly simple user interaction. All of this of course hinges on having a well-calibrated model

9/12

So to sum up: the better models work, the more they get deployed to real people, and the more calibration matters!

Work done with the inimitable @ben_vandurme at @jhuclsp @JHUCompSci

📝: https://arxiv.org/abs/2211.07443
👩‍💻: https://github.com/esteng/calibration_miso
Metric: https://github.com/esteng/calibration_metric

12/12

Calibrated Interpretation: Confidence Estimation in Semantic Parsing

Task-oriented semantic parsing is increasingly being used in user-facing applications, making measuring the calibration of parsing models especially important. We examine the calibration characteristics of six models across three model families on two common English semantic parsing datasets, finding that many models are reasonably well-calibrated and that there is a trade-off between calibration and performance. Based on confidence scores across three models, we propose and release new challenge splits of the two datasets we examine. We then illustrate the ways a calibrated model can be useful in balancing common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that using model confidence allows us to improve the accuracy on validation programs by 9.6% (absolute) with annotator interactions on only 2.2% of tokens. Using sequence-level confidence scores, we then examine how we can optimize trade-off between a parser's usability and safety. We show that confidence-based thresholding can reduce the number of incorrect low-confidence programs executed by 76%; however, this comes at a cost to usability. We propose the DidYouMean system which balances usability and safety. We conclude by calling for calibration to be included in the evaluation of semantic parsing systems, and release a library for computing calibration metrics.

arXiv.org

Semantic parsing is making a comeback in robot manipulation, e.g. SayCan (https://arxiv.org/abs/2204.01691) and ProgPrompt (https://arxiv.org/abs/2209.11302).
Confidence matters here too: you don't want your agent doing things that are likely to fail when the consequences can't be undone.

11/12

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

arXiv.org

This work has broader implications:
For one, executable semantic parsing is a type of NL-to-code task, where user-facing systems are becoming increasingly popular (e.g. copilot). Confidence estimation matters a lot in these interfaces! cf. https://dl.acm.org/doi/abs/10.1145/302979.303030

10/12

Principles of mixed-initiative user interfaces | Proceedings of the SIGCHI conference on Human Factors in Computing Systems

ACM Conferences

We can use a confidence threshold to improve safety (throw away low-confidence predictions) but that comes at a cost to usability. Balancing both safety and usability, we can only do so well with a threshold. DidYouMean lets us beat that by adding a human into the loop.

8/12

We use a calibrated model in 2 ways. First, we have a simulated human-in-the-loop experiment, where token-level calibration can help annotators create new data. We use confidence to trigger human intervention, boosting absolute acc. by ~10% with very few interactions

6/12

We also use confidences in DidYouMean. Here, we want to balance usability (coverage) with safety (risk). A model that executes everything is usable but not so safe (it might execute low-confidence low-prob. programs) while a model that does nothing is super safe but useless

7/12

We found something surprising: across several models and 2 datasets, models were generally pretty well-calibrated.
We use that to create new challenge splits and we release a library for easily computing calibration metrics

So what else can we do with calibrated models?

4/12

Calibration is a superpower in semantic parsing 🦸
It means that you can predict how likely you are to be right BEFORE you execute. This matters especially in physical domains 🤖 where you might not be able to undo an action: you can't unbreak an egg or un-slice an apple

5/12