14 research outputs found

    Pseudo Label Selection is a Decision Problem

    Full text link
    Pseudo-Labeling is a simple and effective approach to semi-supervised learning. It requires criteria that guide the selection of pseudo-labeled data. The latter have been shown to crucially affect pseudo-labeling's generalization performance. Several such criteria exist and were proven to work reasonably well in practice. However, their performance often depends on the initial model fit on labeled data. Early overfitting can be propagated to the final model by choosing instances with overconfident but wrong predictions, often called confirmation bias. In two recent works, we demonstrate that pseudo-label selection (PLS) can be naturally embedded into decision theory. This paves the way for BPLS, a Bayesian framework for PLS that mitigates the issue of confirmation bias. At its heart is a novel selection criterion: an analytical approximation of the posterior predictive of pseudo-samples and labeled data. We derive this selection criterion by proving Bayes-optimality of this "pseudo posterior predictive". We empirically assess BPLS for generalized linear, non-parametric generalized additive models and Bayesian neural networks on simulated and real-world data. When faced with data prone to overfitting and thus a high chance of confirmation bias, BPLS outperforms traditional PLS methods. The decision-theoretic embedding further allows us to render PLS more robust towards the involved modeling assumptions. To achieve this goal, we introduce a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift.Comment: Accepted for presentation at the 46th German Conference on Artificial Intelligenc

    Pseudo-Label Selection: Insights From Decision Theory

    Get PDF

    An Empirical Study of Prior-Data Conflicts in Bayesian Neural Networks

    Get PDF
    Imprecise Probabilities (IP) allow for the representation of incomplete information. In the context of Bayesian statistics, this is achieved by generalized Bayesian inference, where a set of priors is used instead of a single prior [ 1 , Chapter 7.4]. The latter has been shown to be particularly useful in the case of prior-data conflict, where evidence from data (likelihood) contradicts prior information. In these practically highly relevant scenarios, classical (precise) probability models typically fail to adequately represent the uncertainty arising from this conflict. Generalized Bayesian inference by IP, however, was proven to handle these prior-data conflicts well when inference in canonical exponential families is considered [3]. Our study [2] aims at accessing the extent to which these problems of precise probability models are also present in Bayesian neural networks (BNNs). Unlike traditional neural networks, BNNs utilize stochastic weights that can be learned by updating the prior belief with the likelihood for each individual weight using Bayes’ rule. In light of this, we investigate the impact of prior selection on the posterior of BNNs in the context of prior-data conflict. While the literature often advocates for the use of normal priors centered around 0, the consequences of this choice remain unknown when the data suggests high values for the individual weights. For this purpose, we designed synthetic datasets which were generated using neural networks (NN) with fixed high-weight values. This approach enables us to measure the effect of prior-data conflict, as well as reduce the model uncertainty by knowing the exact weights and functional relationship. We utilized BNNs that use the Mean-Field Variational Inference (MFVI) approach, which has not only seen an increasing interest due to its scalability but also allows analytical computation of the posterior distributions, as opposed to simulation-based methods like Markov Chain Monte Carlo (MCMC). In MFVI, the posterior distribution is approximated by a tractable distribution with a factorized form. In our work [ 2, Chapter 4.2], we provide evidence that exact priors centered around the exact weights, which are known from the neural network (NN), outperform their inexact counterparts centered around zero in terms of predictive accuracy, data efficiency and reasonable uncertainty estimations. These results directly imply that selecting a prior centered around 0 may be unintentionally informative, as previously noted by [ 4], resulting in significant losses in prediction accuracy and data requirement, rendering uncertainty estimation impractical. BNNs learned under prior-data conflict resulted in posterior means that were a weighted average of the prior mean and the likelihood highest probability values and therefore exhibited significant differences from the correct weights while also exhibiting an unreasonably low posterior variance, indicating a high degree of certainty in their estimates. Varying the prior variance yielded similar observations, with models using priors with data conflict exhibiting overconfidence in their posterior estimates compared to those using exact priors. To investigate the potential of IP methods, we are currently conducting the effect of expectation- valued interval- parameter, to generate resonable uncertainty predictions. Overall, our preliminary results show that classical BNNs produce overly confident but erroneous predictions in the presence of prior-data conflict. These findings motivate using IP methods in Deep Learning

    Interpreting Generalized Bayesian Inference by Generalized Bayesian Inference

    Get PDF
    The concept of safe Bayesian inference [ 4] with learning rates [5 ] has recently sparked a lot of research, e.g. in the context of generalized linear models [ 2]. It is occasionally also referred to as generalized Bayesian inference, e.g. in [2 , page 1] – a fact that should let IP advocates sit up straight and take notice, as this term is commonly used to describe Bayesian updating of credal sets. On this poster, we demonstrate that this reminiscence extends beyond terminology

    Robust Statistical Comparison of Random Variables with Locally Varying Scale of Measurement

    Full text link
    Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning. Nevertheless, it is still understood as an open question how to exploit the entire information encoded in them properly. We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces. This order contains stochastic dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given. We derive a (regularized) statistical test for our proposed generalized stochastic dominance (GSD) order, operationalize it by linear optimization, and robustify it by imprecise probability models. Our findings are illustrated with data from multidimensional poverty measurement, finance, and medicine.Comment: Accepted for the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023

    Not All Data Are Created Equal: Lessons From Sampling Theory For Adaptive Machine Learning

    Get PDF
    In survey methodology, inverse probability weighted (Horvitz-Thompson) estimation has become an indispensable part of statistical inference. This is triggered by the need to deal with complex samples, that is, non-identically distributed data. The general idea is that weighting observations inversely to their probability of being included in the sample produces unbiased estimators with reduced variance. In this work, we argue that complex samples are subtly ubiquitous in two promising subfields of data science: Self-Training in Semi-Supervised Learning (SSL) and Bayesian Optimization (BO). Both methods rely on refitting learners to artificially enhanced training data. These enhancements are based on pre-defined criteria to select data points rendering some data more likely to be added than others. We experimentally analyze the distance from the so-produced complex samples to i.i.d. samples by Kullback-Leibler divergence and maximum mean discrepancy. What is more, we propose to handle such samples by inverse probability weighting. This requires estimation of inclusion probabilities. Unlike for some observational survey data, however, this is not a major issue since we excitingly have tons of explicit information on the inclusion mechanism. After all, we generate the data ourselves by means of the selection criteria. To make things more tangible, consider the case of BO first. It optimizes an unknown function by iteratively approximating it through a surrogate model, whose mean and standard error estimates are scalarized to a selection criterion. The arguments of this criterion's optima are evaluated and added to the training data. We propose to weight them by means of the surrogate model's standard errors at time of selection. For the case of deploying random forests as surrogate models, we refit them by weighted drawing in the bootstrap sampling step. Refitting may be done iteratively aiming at speeding up the optimization or after convergence aiming at providing applicants with a (global) interpretable surrogate model. Similarly, self-training in SSL selects instances from a set of unlabeled data, predicts its labels and adds these pseudo-labeled data to the training data. Instances are selected according to a confidence measure, e.g. the predictive variance. Regions in the feature space where the model is very confident are thus over-represented in the selected sample. We again explicitly exploit the selection criteria to define weights which we use for resampling-based refitting of the model. Somewhat counter-intuitively, the more confident the model is in the self-assigned labels, the lower their weights should be to counteract the selection bias. Preliminary results suggest this can increase generalization performance

    Learning de-biased regression trees and forests from complex samples

    Get PDF

    Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision

    Get PDF
    We provide a theoretical and computational investigation of the Gamma-Maximin method with soft revision, which was recently proposed as a robust criterion for pseudo-label selection (PLS) in semi-supervised learning. Opposed to traditional methods for PLS we use credal sets of priors ("generalized Bayes") to represent the epistemic modeling uncertainty. These latter are then updated by the Gamma-Maximin method with soft revision. We eventually select pseudo-labeled data that are most likely in light of the least favorable distribution from the so updated credal set. We formalize the task of finding optimal pseudo-labeled data w.r.t. the Gamma-Maximin method with soft revision as an optimization problem. A concrete implementation for the class of logistic models then allows us to compare the predictive power of the method with competing approaches. It is observed that the Gamma-Maximin method with soft revision can achieve very promising results, especially when the proportion of labeled data is low

    In All Likelihoods : Robust Selection of Pseudo-Labeled Data

    No full text
    Self-training is a simple yet effective method within semi-supervised learning. Self-training’s rationale is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we render PLS more robust towards the involved modeling assumptions. To this end, we treat PLS as a decision problem, which allows us to introduce a generalized utility function. The idea is to select pseudo-labeled data that maximize a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian α-cut updating rule for credal sets. We spotlight the application of three of our robust extensions on both simulated and three real-world data sets. In a benchmarking study, we compare these extensions to traditional PLS methods. Results suggest that robustness with regard to model choice can lead to substantial accuracy gains
    corecore