Mastodawn

Daniel Johnson Mar 3, 2023

[9/11]
Empirically, we find that R-U-SURE is better than baselines at identifying the regions that differ between model suggestions and ground truth intents from our test set. The utility of our suggestions against the ground-truth intent is also high, and improves with more samples.

Show thread

Daniel Johnson Mar 3, 2023

[8/11]
We can even invert the meaning of the annotations, and use our system to identify the most useful parts of a long generated sample! This could be used to preemptively show documentation or usage examples instead of directly suggesting code.

Show thread

Daniel Johnson Mar 3, 2023

[6/11]
This is still a difficult optimization problem, so we adapt two tricks from combinatorial optimization: dual decomposition, which breaks our problem into a set of message-passing subproblems, and decision diagrams, which let us solve subproblems efficiently.

Show thread

Daniel Johnson Mar 3, 2023

[5/11]
Our key observation is that samples from a well-trained generative model can be interpreted as plausible goal states for the user's code! We can thus use these samples to approximate the expected utility of a suggestion, similar to sample-based minimum Bayes risk decoding.

Show thread

Daniel Johnson Mar 3, 2023

[4/11]
Formally, our goal is to find an annotated suggestion that maximizes our edit-distance based utility metric for the (unknown) code that the user wants to write. Since we don't know the user's intent exactly, we maximize the expected value of this metric over possible intents.

Show thread

Daniel Johnson Mar 3, 2023

[3/11]
In contrast, our system produces annotations by explicitly approximating the utility of a suggestion for a user with a particular intent. We focus on edit distance, and assume that identifying regions as uncertain makes them easier to edit, but less useful if they are correct.

Show thread

Daniel Johnson Mar 3, 2023

[2/11]
One way to understand the uncertainty in a language model's output is to look at its per-token probabilities. However, this can be hard to interpret and sometimes misleading, since token probabilities always depend on all previous tokens and on the model vocabulary.

Daniel Johnson Mar 3, 2023

LLM-based assistants can speed up software development, but what should they do when they aren't sure what code to write? We're excited to share R-U-SURE, a drop-in system for adding uncertainty annotations to code suggestions!

Read our paper here: https://arxiv.org/abs/2303.00732

#PaperThread [1/11]

R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents

Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. When used to assist software developers, these models may make mistakes that users must go back and fix, or worse, introduce subtle bugs that users may miss entirely. We propose Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE), an approach for building uncertainty-aware suggestions based on a decision-theoretic model of goal-conditioned utility, using random samples from a generative model as a proxy for the unobserved possible intents of the end user. Our technique combines minimum-Bayes-risk decoding, dual decomposition, and decision diagrams in order to efficiently produce structured uncertainty summaries, given only sample access to an arbitrary generative model of code and an optional AST parser. We demonstrate R-U-SURE on three developer-assistance tasks, and show that it can be applied different user interaction patterns without retraining the model and leads to more accurate uncertainty estimates than token-probability baselines.

arXiv.org

Website	https://www.danieldjohnson.com/
Twitter	@_ddjohnson