Mastodawn

Neil Band Nov 27, 2022

[N/N] @timrudner @qixuan_feng @filangelos @zacharynado @dusenberrymw @Ghassen_ML @dustinvtran @Yarin

Neil Band Nov 27, 2022

[13/N] Thank you to my co-first author Tim G. J. Rudner (@timrudner), co-authors from OATML and @google and the many other collaborators who made this work possible!

Show thread

Neil Band Nov 27, 2022

[12/N] For example, in “Plex: Towards Reliability using Pretrained Large Model Extensions” @dustinvtran et al.), we evaluate the performance of pretrained models on RETINA. (https://arxiv.org/abs/2207.07411)

Plex: Towards Reliability using Pretrained Large Model Extensions

A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.

arXiv.org

Show thread

Neil Band Nov 27, 2022

[11/N] Our codebase will allow you to reproduce experiments (we provide 100+ tuned checkpoints over 6 random seeds) and benchmark your own BDL methods for predictive performance, robustness, and uncertainty quantification (evaluation and plotting).

Show thread

Neil Band Nov 27, 2022

[10/N] To enable future research on reliability in safety-critical settings, the RETINA Benchmark is open-sourced as part of Uncertainty Baselines:
https://github.com/google/uncertainty-baselines

GitHub - google/uncertainty-baselines: High-quality implementations of standard and SOTA methods on a variety of tasks.

High-quality implementations of standard and SOTA methods on a variety of tasks. - GitHub - google/uncertainty-baselines: High-quality implementations of standard and SOTA methods on a variety of t...

GitHub

Show thread

Neil Band Nov 27, 2022

[9/N] Our experiments swept over 400+ hyperparameter configurations using 100+ TPU days and 20+ GPU days (s/o to @google @IntelLabs for their generous support!).

Show thread

Neil Band Nov 27, 2022

[8/N] Many more experiments in the paper, including:
- Severity Shift: can models adapt to more severe cases than seen before?
- Predictive entropy histograms at each retinopathy severity level, OOD detection, ECE, class imbalance and preprocessing ablations.

Show thread

Neil Band Nov 27, 2022

[7/N] Another finding is that *there is no single best method*. For example, MFVI (purple) has the strongest selective prediction performance under the Country Shift (right) but the worst when evaluated in-domain (left).

Show thread

Neil Band Nov 27, 2022

[6/N] Using selective prediction, we see that model performance is significantly worse under the Country Shift.

Show thread

Neil Band Nov 27, 2022

[5/N] We use “selective prediction” to simulate automated diagnosis pipelines, computed as pictured. If a model has good uncertainty, its performance p should increase in the proportion of patients referred to a medical expert 𝛕.

Website	https://nband.github.io/
Twitter	https://twitter.com/neilbband