1,219 research outputs found

    On kernel and feature learning in neural networks

    Get PDF
    Inspired by the theory of wide neural networks (NNs), kernel learning and feature learning have recently emerged as two paradigms through which we can understand the complex behaviours of large-scale deep learning systems in practice. In the literature, they are often portrayed as two opposing ends of a dichotomy, both with their own strengths and weaknesses: one, kernel learning, draws connections to well-studied machine learning techniques like kernel methods and Gaussian Processes, whereas the other, feature learning, promises to capture more of the rich, but yet unexplained, properties that are unique to NNs. In this thesis, we present three works studying properties of NNs that combine insights from both perspectives, highlighting not only their differences but also shared similarities. We start by reviewing relevant literature on the theory of deep learning, with a focus on the study of wide NNs. This provides context for a discussion of kernel and feature learning, and against this backdrop, we proceed to describe our contributions. First, we examine the relationship between ensembles of wide NNs and Bayesian inference using connections from kernel learning to Gaussian Processes, and propose a modification that accounts for missing variance at initialisation in NN functions, resulting in a Bayesian interpretation to our trained deep ensembles. Next, we combine kernel and feature learning to demonstrate the suitability of the feature kernel, i.e. the kernel induced by inner products over final layer NN features, as a target for knowledge distillation, where one seeks to use a powerful teacher model to improve the performance of a weaker student model. Finally, we explore the gap between collapsed and whitened features in self-supervised learning, highlighting the decay rate of eigenvalues in the feature kernel as a key quantity that bridges between this gap and impacts downstream generalisation performance, especially in settings with scarce labelled data. We conclude with a discussion, including limitations and future outlook, of our contributions

    Repulsive Deep Ensembles are Bayesian

    Full text link
    Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks

    Single Model Uncertainty Estimation via Stochastic Data Centering

    Full text link
    We are interested in estimating the uncertainties of deep neural networks, which play an important role in many scientific and engineering problems. In this paper, we present a striking new finding that an ensemble of neural networks with the same weight initialization, trained on datasets that are shifted by a constant bias gives rise to slightly inconsistent trained models, where the differences in predictions are a strong indicator of epistemic uncertainties. Using the neural tangent kernel (NTK), we demonstrate that this phenomena occurs in part because the NTK is not shift-invariant. Since this is achieved via a trivial input transformation, we show that it can therefore be approximated using just a single neural network -- using a technique that we call Δ−\Delta-UQ -- that estimates uncertainty around prediction by marginalizing out the effect of the biases. We show that Δ−\Delta-UQ's uncertainty estimates are superior to many of the current methods on a variety of benchmarks -- outlier rejection, calibration under distribution shift, and sequential design optimization of black box functions

    On the reversed bias-variance tradeoff in deep ensembles

    Full text link
    Deep ensembles aggregate predictions of diverse neural networks to improve generalisation and quantify uncertainty. Here, we investigate their behavior when increasing the ensemble mem- bers’ parameter size - a practice typically asso- ciated with better performance for single mod- els. We show that under practical assumptions in the overparametrized regime far into the dou- ble descent curve, not only the ensemble test loss degrades, but common out-of-distribution detec- tion and calibration metrics suffer as well. Rem- iniscent to deep double descent, we observe this phenomenon not only when increasing the single member’s capacity but also as we increase the training budget, suggesting deep ensembles can benefit from early stopping. This sheds light on the success and failure modes of deep ensembles and suggests that averaging finite width models perform better than the neural tangent kernel limit for these metrics

    Empirical analysis of representation learning and exploration in neural kernel bandits

    Full text link
    Neural bandits have been shown to provide an efficient solution to practical sequential decision tasks that have nonlinear reward functions. The main contributor to that success is approximate Bayesian inference, which enables neural network (NN) training with uncertainty estimates. However, Bayesian NNs often suffer from a prohibitive computational overhead or operate on a subset of parameters. Alternatively, certain classes of infinite neural networks were shown to directly correspond to Gaussian processes (GP) with neural kernels (NK). NK-GPs provide accurate uncertainty estimates and can be trained faster than most Bayesian NNs. We propose to guide common bandit policies with NK distributions and show that NK bandits achieve state-of-the-art performance on nonlinear structured data. Moreover, we propose a framework for measuring independently the ability of a bandit algorithm to learn representations and explore, and use it to analyze the impact of NK distributions w.r.t.~those two aspects. We consider policies based on a GP and a Student's t-process (TP). Furthermore, we study practical considerations, such as training frequency and model partitioning. We believe our work will help better understand the impact of utilizing NKs in applied settings.Comment: Extended version. Added a major experiment comparing NK distribution w.r.t. exploration and exploitation. Submitted to ICLR 202

    Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks

    Full text link
    Uncertainty quantification (UQ) is important for reliability assessment and enhancement of machine learning models. In deep learning, uncertainties arise not only from data, but also from the training procedure that often injects substantial noises and biases. These hinder the attainment of statistical guarantees and, moreover, impose computational challenges on UQ due to the need for repeated network retraining. Building upon the recent neural tangent kernel theory, we create statistically guaranteed schemes to principally \emph{quantify}, and \emph{remove}, the procedural uncertainty of over-parameterized neural networks with very low computation effort. In particular, our approach, based on what we call a procedural-noise-correcting (PNC) predictor, removes the procedural uncertainty by using only \emph{one} auxiliary network that is trained on a suitably labeled data set, instead of many retrained networks employed in deep ensembles. Moreover, by combining our PNC predictor with suitable light-computation resampling methods, we build several approaches to construct asymptotically exact-coverage confidence intervals using as low as four trained networks without additional overheads

    A Primer on Bayesian Neural Networks: Review and Debates

    Full text link
    Neural networks have achieved remarkable performance across various problem domains, but their widespread applicability is hindered by inherent limitations such as overconfidence in predictions, lack of interpretability, and vulnerability to adversarial attacks. To address these challenges, Bayesian neural networks (BNNs) have emerged as a compelling extension of conventional neural networks, integrating uncertainty estimation into their predictive capabilities. This comprehensive primer presents a systematic introduction to the fundamental concepts of neural networks and Bayesian inference, elucidating their synergistic integration for the development of BNNs. The target audience comprises statisticians with a potential background in Bayesian methods but lacking deep learning expertise, as well as machine learners proficient in deep neural networks but with limited exposure to Bayesian statistics. We provide an overview of commonly employed priors, examining their impact on model behavior and performance. Additionally, we delve into the practical considerations associated with training and inference in BNNs. Furthermore, we explore advanced topics within the realm of BNN research, acknowledging the existence of ongoing debates and controversies. By offering insights into cutting-edge developments, this primer not only equips researchers and practitioners with a solid foundation in BNNs, but also illuminates the potential applications of this dynamic field. As a valuable resource, it fosters an understanding of BNNs and their promising prospects, facilitating further advancements in the pursuit of knowledge and innovation.Comment: 65 page
    • …
    corecore