138 research outputs found
Better Uncertainty Calibration via Proper Scores for Classification and Beyond
With model trustworthiness being crucial for sensitive real-world
applications, practitioners are putting more and more focus on improving the
uncertainty calibration of deep neural networks. Calibration errors are
designed to quantify the reliability of probabilistic predictions but their
estimators are usually biased and inconsistent. In this work, we introduce the
framework of proper calibration errors, which relates every calibration error
to a proper score and provides a respective upper bound with optimal estimation
properties. This relationship can be used to reliably quantify the model
calibration improvement. We theoretically and empirically demonstrate the
shortcomings of commonly used estimators compared to our approach. Due to the
wide applicability of proper scores, this gives a natural extension of
recalibration beyond classification.Comment: Accepted at NeurIPS 202
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Reliably estimating the uncertainty of a prediction throughout the model
lifecycle is crucial in many safety-critical applications. The most common way
to measure this uncertainty is via the predicted confidence. While this tends
to work well for in-domain samples, these estimates are unreliable under domain
drift and restricted to classification. Alternatively, proper scores can be
used for most predictive tasks but a bias-variance decomposition for model
uncertainty does not exist in the current literature. In this work we introduce
a general bias-variance decomposition for proper scores, giving rise to the
Bregman Information as the variance term. We discover how exponential families
and the classification log-likelihood are special cases and provide novel
formulations. Surprisingly, we can express the classification case purely in
the logit space. We showcase the practical relevance of this decomposition on
several downstream tasks, including model ensembles and confidence regions.
Further, we demonstrate how different approximations of the instance-level
Bregman Information allow reliable out-of-distribution detection for all
degrees of domain drift.Comment: Accepted at AISTATS 202
A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models
Generative models, like large language models, are becoming increasingly
relevant in our daily lives, yet a theoretical framework to assess their
generalization behavior and uncertainty does not exist. Particularly, the
problem of uncertainty estimation is commonly solved in an ad-hoc manner and
task dependent. For example, natural language approaches cannot be transferred
to image generation. In this paper we introduce the first
bias-variance-covariance decomposition for kernel scores and their associated
entropy. We propose unbiased and consistent estimators for each quantity which
only require generated samples but not the underlying model itself. As an
application, we offer a generalization evaluation of diffusion models and
discover how mode collapse of minority groups is a contrary phenomenon to
overfitting. Further, we demonstrate that variance and predictive kernel
entropy are viable measures of uncertainty for image, audio, and language
generation. Specifically, our approach for uncertainty estimation is more
predictive of performance on CoQA and TriviaQA question answering datasets than
existing baselines and can also be applied to closed-source models.Comment: Preprin
Parameterized Temperature Scaling for Boosting the Expressive Power in Post-Hoc Uncertainty Calibration
We address the problem of uncertainty calibration and introduce a novel
calibration method, Parametrized Temperature Scaling (PTS). Standard deep
neural networks typically yield uncalibrated predictions, which can be
transformed into calibrated confidence scores using post-hoc calibration
methods. In this contribution, we demonstrate that the performance of
accuracy-preserving state-of-the-art post-hoc calibrators is limited by their
intrinsic expressive power. We generalize temperature scaling by computing
prediction-specific temperatures, parameterized by a neural network. We show
with extensive experiments that our novel accuracy-preserving approach
consistently outperforms existing algorithms across a large number of model
architectures, datasets and metrics.Comment: Technical repor
- …