11 Followers
23 Following
15 Posts
CS PhD Student at Stanford University
Websitehttps://nband.github.io/
Twitterhttps://twitter.com/neilbband
[6/N] Using selective prediction, we see that model performance is significantly worse under the Country Shift.
[5/N] We use “selective prediction” to simulate automated diagnosis pipelines, computed as pictured. If a model has good uncertainty, its performance p should increase in the proportion of patients referred to a medical expert 𝛕.

[4/N] Our main takeaway: Uncertainty-ambivalent evaluation can be misleading. E.g., on the “Country Shift” task, models are trained on the US-sourced EyePACS dataset and evaluated out-of-domain on the Indian APTOS dataset.

Counterintuitively, when considering ROC curves, methods consistently perform better on the distributionally shifted APTOS data than in-domain (black dot is the NHS-recommended threshold for automated diagnosis).

[3/N] We curated two public datasets of high-res human retina images exhibiting varying degrees of diabetic retinopathy, and evaluated methods on an automated diagnosis task (pictured) that requires reliable predictive uncertainty quantification.
[2/N] To this end, we designed RETINA, a suite of real-world tasks assessing the reliability of several established and SoTA Bayesian and non-Bayesian uncertainty quantification methods.

[1/N] Bayesian Deep Learning has promised to improve neural network reliability on safety-critical applications, such as those in healthcare and autonomous driving.

Yet to holistically assess Bayesian Deep Learning methods, we need benchmarks on real-world tasks that reflect realistic distribution shifts, and strong uncertainty quantification baselines that capture both aleatoric and epistemic uncertainty.

Announcing the public release of the #̶N̶e̶u̶r̶I̶P̶S̶2̶0̶2̶2̶ #NeurIPS2021 (😅) RETINA Benchmark:

A suite of tasks evaluating the reliability of uncertainty quantification methods like Deep Ensembles, MC Dropout, Parameter- and Function-Space VI, and more.

Paper: https://arxiv.org/abs/2211.12717
Code+Checkpoints: https://rebrand.ly/retina-benchmark

#NewPaper #arxiv #PaperThread
🧵 below👇🏾 [0/N]

Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose the RETINA Benchmark, a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.

arXiv.org