14 research outputs found
Pseudo Label Selection is a Decision Problem
Pseudo-Labeling is a simple and effective approach to semi-supervised
learning. It requires criteria that guide the selection of pseudo-labeled data.
The latter have been shown to crucially affect pseudo-labeling's generalization
performance. Several such criteria exist and were proven to work reasonably
well in practice. However, their performance often depends on the initial model
fit on labeled data. Early overfitting can be propagated to the final model by
choosing instances with overconfident but wrong predictions, often called
confirmation bias. In two recent works, we demonstrate that pseudo-label
selection (PLS) can be naturally embedded into decision theory. This paves the
way for BPLS, a Bayesian framework for PLS that mitigates the issue of
confirmation bias. At its heart is a novel selection criterion: an analytical
approximation of the posterior predictive of pseudo-samples and labeled data.
We derive this selection criterion by proving Bayes-optimality of this "pseudo
posterior predictive". We empirically assess BPLS for generalized linear,
non-parametric generalized additive models and Bayesian neural networks on
simulated and real-world data. When faced with data prone to overfitting and
thus a high chance of confirmation bias, BPLS outperforms traditional PLS
methods. The decision-theoretic embedding further allows us to render PLS more
robust towards the involved modeling assumptions. To achieve this goal, we
introduce a multi-objective utility function. We demonstrate that the latter
can be constructed to account for different sources of uncertainty and explore
three examples: model selection, accumulation of errors and covariate shift.Comment: Accepted for presentation at the 46th German Conference on Artificial
Intelligenc
An Empirical Study of Prior-Data Conflicts in Bayesian Neural Networks
Imprecise Probabilities (IP) allow for the representation of incomplete information. In the context of Bayesian statistics,
this is achieved by generalized Bayesian inference, where a set of priors is used instead of a single prior [ 1 , Chapter 7.4].
The latter has been shown to be particularly useful in the case of prior-data conflict, where evidence from data (likelihood)
contradicts prior information. In these practically highly relevant scenarios, classical (precise) probability models typically
fail to adequately represent the uncertainty arising from this conflict. Generalized Bayesian inference by IP, however, was
proven to handle these prior-data conflicts well when inference in canonical exponential families is considered [3].
Our study [2] aims at accessing the extent to which these problems of precise probability models are also present in
Bayesian neural networks (BNNs). Unlike traditional neural networks, BNNs utilize stochastic weights that can be learned
by updating the prior belief with the likelihood for each individual weight using Bayes’ rule. In light of this, we investigate
the impact of prior selection on the posterior of BNNs in the context of prior-data conflict. While the literature often
advocates for the use of normal priors centered around 0, the consequences of this choice remain unknown when the data
suggests high values for the individual weights. For this purpose, we designed synthetic datasets which were generated
using neural networks (NN) with fixed high-weight values. This approach enables us to measure the effect of prior-data
conflict, as well as reduce the model uncertainty by knowing the exact weights and functional relationship. We utilized
BNNs that use the Mean-Field Variational Inference (MFVI) approach, which has not only seen an increasing interest
due to its scalability but also allows analytical computation of the posterior distributions, as opposed to simulation-based
methods like Markov Chain Monte Carlo (MCMC). In MFVI, the posterior distribution is approximated by a tractable
distribution with a factorized form.
In our work [ 2, Chapter 4.2], we provide evidence that exact priors centered around the exact weights, which are known
from the neural network (NN), outperform their inexact counterparts centered around zero in terms of predictive accuracy,
data efficiency and reasonable uncertainty estimations. These results directly imply that selecting a prior centered around 0
may be unintentionally informative, as previously noted by [ 4], resulting in significant losses in prediction accuracy and
data requirement, rendering uncertainty estimation impractical. BNNs learned under prior-data conflict resulted in posterior
means that were a weighted average of the prior mean and the likelihood highest probability values and therefore exhibited
significant differences from the correct weights while also exhibiting an unreasonably low posterior variance, indicating a
high degree of certainty in their estimates. Varying the prior variance yielded similar observations, with models using
priors with data conflict exhibiting overconfidence in their posterior estimates compared to those using exact priors.
To investigate the potential of IP methods, we are currently conducting the effect of expectation- valued interval-
parameter, to generate resonable uncertainty predictions. Overall, our preliminary results show that classical BNNs produce
overly confident but erroneous predictions in the presence of prior-data conflict. These findings motivate using IP methods
in Deep Learning
Interpreting Generalized Bayesian Inference by Generalized Bayesian Inference
The concept of safe Bayesian inference [ 4] with learning rates [5 ] has recently sparked a lot of research, e.g. in the context of generalized linear models [ 2]. It is occasionally also referred to as generalized Bayesian inference, e.g. in [2 , page 1] – a fact that should let IP advocates sit up straight and take notice, as this term is commonly used to describe Bayesian updating of credal sets. On this poster, we demonstrate that this reminiscence extends beyond terminology
Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement
Spaces with locally varying scale of measurement, like multidimensional
structures with differently scaled dimensions, are pretty common in statistics
and machine learning. Nevertheless, it is still understood as an open question
how to exploit the entire information encoded in them properly. We address this
problem by considering an order based on (sets of) expectations of random
variables mapping into such non-standard spaces. This order contains stochastic
dominance and expectation order as extreme cases when no, or respectively
perfect, cardinal structure is given. We derive a (regularized) statistical
test for our proposed generalized stochastic dominance (GSD) order,
operationalize it by linear optimization, and robustify it by imprecise
probability models. Our findings are illustrated with data from
multidimensional poverty measurement, finance, and medicine.Comment: Accepted for the 39th Conference on Uncertainty in Artificial
Intelligence (UAI 2023
Not All Data Are Created Equal: Lessons From Sampling Theory For Adaptive Machine Learning
In survey methodology, inverse probability weighted (Horvitz-Thompson) estimation has become an indispensable part of statistical inference. This is triggered by the need to deal with complex samples, that is, non-identically distributed data. The general idea is that weighting observations inversely to their probability of being included in the sample produces unbiased estimators with reduced variance.
In this work, we argue that complex samples are subtly ubiquitous in two promising subfields of data science: Self-Training in Semi-Supervised Learning (SSL) and Bayesian Optimization (BO). Both methods rely on refitting learners to artificially enhanced training data. These enhancements are based on pre-defined criteria to select data points rendering some data more likely to be added than others. We experimentally analyze the distance from the so-produced complex samples to i.i.d. samples by Kullback-Leibler divergence and maximum mean discrepancy. What is more, we propose to handle such samples by inverse probability weighting. This requires estimation of inclusion probabilities. Unlike for some observational survey data, however, this is not a major issue since we excitingly have tons of explicit information on the inclusion mechanism. After all, we generate the data ourselves by means of the selection criteria.
To make things more tangible, consider the case of BO first. It optimizes an unknown function by iteratively approximating it through a surrogate model, whose mean and standard error estimates are scalarized to a selection criterion. The arguments of this criterion's optima are evaluated and added to the training data. We propose to weight them by means of the surrogate model's standard errors at time of selection. For the case of deploying random forests as surrogate models, we refit them by weighted drawing in the bootstrap sampling step. Refitting may be done iteratively aiming at speeding up the optimization or after convergence aiming at providing applicants with a (global) interpretable surrogate model.
Similarly, self-training in SSL selects instances from a set of unlabeled data, predicts its labels and adds these pseudo-labeled data to the training data. Instances are selected according to a confidence measure, e.g. the predictive variance. Regions in the feature space where the model is very confident are thus over-represented in the selected sample. We again explicitly exploit the selection criteria to define weights which we use for resampling-based refitting of the model. Somewhat counter-intuitively, the more confident the model is in the self-assigned labels, the lower their weights should be to counteract the selection bias. Preliminary results suggest this can increase generalization performance
Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision
We provide a theoretical and computational investigation of the Gamma-Maximin method with soft revision, which was recently proposed as a robust criterion for pseudo-label selection (PLS) in semi-supervised learning. Opposed to traditional methods for PLS we use credal sets of priors ("generalized Bayes") to represent the epistemic modeling uncertainty. These latter are then updated by the Gamma-Maximin method with soft revision. We eventually select pseudo-labeled data that are most likely in light of the least favorable distribution from the so updated credal set. We formalize the task of finding optimal pseudo-labeled data w.r.t. the Gamma-Maximin method with soft revision as an optimization problem. A concrete implementation for the class of logistic models then allows us to compare the predictive power of the method with competing approaches. It is observed that the Gamma-Maximin method with soft revision can achieve very promising results, especially when the proportion of labeled data is low
In All Likelihoods : Robust Selection of Pseudo-Labeled Data
Self-training is a simple yet effective method within semi-supervised learning. Self-training’s rationale is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we render PLS more robust towards the involved modeling assumptions. To this end, we treat PLS as a decision problem, which allows us to introduce a generalized utility function. The idea is to select pseudo-labeled data that maximize a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian α-cut updating rule for credal sets. We spotlight the application of three of our robust extensions on both simulated and three real-world data sets. In a benchmarking study, we compare these extensions to traditional PLS methods. Results suggest that robustness with regard to model choice can lead to substantial accuracy gains