[N/N]
@timrudner @qixuan_feng @filangelos @zacharynado @dusenberrymw @Ghassen_ML @dustinvtran
@Yarin[13/N] Thank you to my co-first author Tim G. J. Rudner (
@timrudner), co-authors from OATML and
@google and the many other collaborators who made this work possible!
[12/N] For example, in “Plex: Towards Reliability using Pretrained Large Model Extensions” @dustinvtran et al.), we evaluate the performance of pretrained models on RETINA. (
https://arxiv.org/abs/2207.07411)
Plex: Towards Reliability using Pretrained Large Model Extensions
A recent trend in artificial intelligence is the use of pretrained models for
language and vision tasks, which have achieved extraordinary performance but
also puzzling failures. Probing these models' abilities in diverse ways is
therefore critical to the field. In this paper, we explore the reliability of
models, where we define a reliable model as one that not only achieves strong
predictive performance but also performs well consistently over many
decision-making tasks involving uncertainty (e.g., selective prediction, open
set recognition), robust generalization (e.g., accuracy and proper scoring
rules such as log-likelihood on in- and out-of-distribution datasets), and
adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of
tasks over 40 datasets in order to evaluate different aspects of reliability on
both vision and language domains. To improve reliability, we developed ViT-Plex
and T5-Plex, pretrained large model extensions for vision and language
modalities, respectively. Plex greatly improves the state-of-the-art across
reliability tasks, and simplifies the traditional protocol as it improves the
out-of-the-box performance and does not require designing scores or tuning the
model for each task. We demonstrate scaling effects over model sizes up to 1B
parameters and pretraining dataset sizes up to 4B examples. We also demonstrate
Plex's capabilities on challenging tasks including zero-shot open set
recognition, active learning, and uncertainty in conversational language
understanding.
arXiv.org[11/N] Our codebase will allow you to reproduce experiments (we provide 100+ tuned checkpoints over 6 random seeds) and benchmark your own BDL methods for predictive performance, robustness, and uncertainty quantification (evaluation and plotting).
[10/N] To enable future research on reliability in safety-critical settings, the RETINA Benchmark is open-sourced as part of Uncertainty Baselines:
https://github.com/google/uncertainty-baselines
GitHub - google/uncertainty-baselines: High-quality implementations of standard and SOTA methods on a variety of tasks.
High-quality implementations of standard and SOTA methods on a variety of tasks. - GitHub - google/uncertainty-baselines: High-quality implementations of standard and SOTA methods on a variety of t...
GitHub[9/N] Our experiments swept over 400+ hyperparameter configurations using 100+ TPU days and 20+ GPU days (s/o to
@google @IntelLabs for their generous support!).
[8/N] Many more experiments in the paper, including:
- Severity Shift: can models adapt to more severe cases than seen before?
- Predictive entropy histograms at each retinopathy severity level, OOD detection, ECE, class imbalance and preprocessing ablations.
[7/N] Another finding is that *there is no single best method*. For example, MFVI (purple) has the strongest selective prediction performance under the Country Shift (right) but the worst when evaluated in-domain (left).
[6/N] Using selective prediction, we see that model performance is significantly worse under the Country Shift.
[5/N] We use “selective prediction” to simulate automated diagnosis pipelines, computed as pictured. If a model has good uncertainty, its performance p should increase in the proportion of patients referred to a medical expert 𝛕.