Daniel Johnson

144 Followers
113 Following
47 Posts
PhD student at Vector Institute / University of Toronto. Building tools to study neural nets and find out what they know. He/him.
Websitehttps://www.danieldjohnson.com/
Twitter@_ddjohnson
[9/11]
Empirically, we find that R-U-SURE is better than baselines at identifying the regions that differ between model suggestions and ground truth intents from our test set. The utility of our suggestions against the ground-truth intent is also high, and improves with more samples.
[8/11]
We can even invert the meaning of the annotations, and use our system to identify the most useful parts of a long generated sample! This could be used to preemptively show documentation or usage examples instead of directly suggesting code.
[6/11]
This is still a difficult optimization problem, so we adapt two tricks from combinatorial optimization: dual decomposition, which breaks our problem into a set of message-passing subproblems, and decision diagrams, which let us solve subproblems efficiently.
[5/11]
Our key observation is that samples from a well-trained generative model can be interpreted as plausible goal states for the user's code! We can thus use these samples to approximate the expected utility of a suggestion, similar to sample-based minimum Bayes risk decoding.
[4/11]
Formally, our goal is to find an annotated suggestion that maximizes our edit-distance based utility metric for the (unknown) code that the user wants to write. Since we don't know the user's intent exactly, we maximize the expected value of this metric over possible intents.
[3/11]
In contrast, our system produces annotations by explicitly approximating the utility of a suggestion for a user with a particular intent. We focus on edit distance, and assume that identifying regions as uncertain makes them easier to edit, but less useful if they are correct.
[2/11]
One way to understand the uncertainty in a language model's output is to look at its per-token probabilities. However, this can be hard to interpret and sometimes misleading, since token probabilities always depend on all previous tokens and on the model vocabulary.

LLM-based assistants can speed up software development, but what should they do when they aren't sure what code to write? We're excited to share R-U-SURE, a drop-in system for adding uncertainty annotations to code suggestions!

Read our paper here: https://arxiv.org/abs/2303.00732

#PaperThread [1/11]

R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents

Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. When used to assist software developers, these models may make mistakes that users must go back and fix, or worse, introduce subtle bugs that users may miss entirely. We propose Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE), an approach for building uncertainty-aware suggestions based on a decision-theoretic model of goal-conditioned utility, using random samples from a generative model as a proxy for the unobserved possible intents of the end user. Our technique combines minimum-Bayes-risk decoding, dual decomposition, and decision diagrams in order to efficiently produce structured uncertainty summaries, given only sample access to an arbitrary generative model of code and an optional AST parser. We demonstrate R-U-SURE on three developer-assistance tasks, and show that it can be applied different user interaction patterns without retraining the model and leads to more accurate uncertainty estimates than token-probability baselines.

arXiv.org