1,219 research outputs found
On kernel and feature learning in neural networks
Inspired by the theory of wide neural networks (NNs), kernel learning and feature learning have recently emerged as two paradigms through which we can understand the complex behaviours of large-scale deep learning systems in practice. In the literature, they are often portrayed as two opposing ends of a dichotomy, both with their own strengths and weaknesses: one, kernel learning, draws connections to well-studied machine learning techniques like kernel methods and Gaussian Processes, whereas the other, feature learning, promises to capture more of the rich, but yet unexplained, properties that are unique to NNs.
In this thesis, we present three works studying properties of NNs that combine insights from both perspectives, highlighting not only their differences but also shared similarities. We start by reviewing relevant literature on the theory of deep learning, with a focus on the study of wide NNs. This provides context for a discussion of kernel and feature learning, and against this backdrop, we proceed to describe our contributions. First, we examine the relationship between ensembles of wide NNs and Bayesian inference using connections from kernel learning to Gaussian Processes, and propose a modification that accounts for missing variance at initialisation in NN functions, resulting in a Bayesian interpretation to our trained deep ensembles. Next, we combine kernel and feature learning to demonstrate the suitability of the feature kernel, i.e. the kernel induced by inner products over final layer NN features, as a target for knowledge distillation, where one seeks to use a powerful teacher model to improve the performance of a weaker student model. Finally, we explore the gap between collapsed and whitened features in self-supervised learning, highlighting the decay rate of eigenvalues in the feature kernel as a key quantity that bridges between this gap and impacts downstream generalisation performance, especially in settings with scarce labelled data. We conclude with a discussion, including limitations and future outlook, of our contributions
Repulsive Deep Ensembles are Bayesian
Deep ensembles have recently gained popularity in the deep learning community
for their conceptual simplicity and efficiency. However, maintaining functional
diversity between ensemble members that are independently trained with gradient
descent is challenging. This can lead to pathologies when adding more ensemble
members, such as a saturation of the ensemble performance, which converges to
the performance of a single model. Moreover, this does not only affect the
quality of its predictions, but even more so the uncertainty estimates of the
ensemble, and thus its performance on out-of-distribution data. We hypothesize
that this limitation can be overcome by discouraging different ensemble members
from collapsing to the same function. To this end, we introduce a kernelized
repulsive term in the update rule of the deep ensembles. We show that this
simple modification not only enforces and maintains diversity among the members
but, even more importantly, transforms the maximum a posteriori inference into
proper Bayesian inference. Namely, we show that the training dynamics of our
proposed repulsive ensembles follow a Wasserstein gradient flow of the KL
divergence with the true posterior. We study repulsive terms in weight and
function space and empirically compare their performance to standard ensembles
and Bayesian baselines on synthetic and real-world prediction tasks
Single Model Uncertainty Estimation via Stochastic Data Centering
We are interested in estimating the uncertainties of deep neural networks,
which play an important role in many scientific and engineering problems. In
this paper, we present a striking new finding that an ensemble of neural
networks with the same weight initialization, trained on datasets that are
shifted by a constant bias gives rise to slightly inconsistent trained models,
where the differences in predictions are a strong indicator of epistemic
uncertainties. Using the neural tangent kernel (NTK), we demonstrate that this
phenomena occurs in part because the NTK is not shift-invariant. Since this is
achieved via a trivial input transformation, we show that it can therefore be
approximated using just a single neural network -- using a technique that we
call UQ -- that estimates uncertainty around prediction by
marginalizing out the effect of the biases. We show that UQ's
uncertainty estimates are superior to many of the current methods on a variety
of benchmarks -- outlier rejection, calibration under distribution shift, and
sequential design optimization of black box functions
On the reversed bias-variance tradeoff in deep ensembles
Deep ensembles aggregate predictions of diverse neural networks to improve generalisation and quantify uncertainty. Here, we investigate their behavior when increasing the ensemble mem- bers’ parameter size - a practice typically asso- ciated with better performance for single mod- els. We show that under practical assumptions in the overparametrized regime far into the dou- ble descent curve, not only the ensemble test loss degrades, but common out-of-distribution detec- tion and calibration metrics suffer as well. Rem- iniscent to deep double descent, we observe this phenomenon not only when increasing the single member’s capacity but also as we increase the training budget, suggesting deep ensembles can benefit from early stopping. This sheds light on the success and failure modes of deep ensembles and suggests that averaging finite width models perform better than the neural tangent kernel limit for these metrics
Empirical analysis of representation learning and exploration in neural kernel bandits
Neural bandits have been shown to provide an efficient solution to practical
sequential decision tasks that have nonlinear reward functions. The main
contributor to that success is approximate Bayesian inference, which enables
neural network (NN) training with uncertainty estimates. However, Bayesian NNs
often suffer from a prohibitive computational overhead or operate on a subset
of parameters. Alternatively, certain classes of infinite neural networks were
shown to directly correspond to Gaussian processes (GP) with neural kernels
(NK). NK-GPs provide accurate uncertainty estimates and can be trained faster
than most Bayesian NNs. We propose to guide common bandit policies with NK
distributions and show that NK bandits achieve state-of-the-art performance on
nonlinear structured data. Moreover, we propose a framework for measuring
independently the ability of a bandit algorithm to learn representations and
explore, and use it to analyze the impact of NK distributions w.r.t.~those two
aspects. We consider policies based on a GP and a Student's t-process (TP).
Furthermore, we study practical considerations, such as training frequency and
model partitioning. We believe our work will help better understand the impact
of utilizing NKs in applied settings.Comment: Extended version. Added a major experiment comparing NK distribution
w.r.t. exploration and exploitation. Submitted to ICLR 202
Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks
Uncertainty quantification (UQ) is important for reliability assessment and
enhancement of machine learning models. In deep learning, uncertainties arise
not only from data, but also from the training procedure that often injects
substantial noises and biases. These hinder the attainment of statistical
guarantees and, moreover, impose computational challenges on UQ due to the need
for repeated network retraining. Building upon the recent neural tangent kernel
theory, we create statistically guaranteed schemes to principally
\emph{quantify}, and \emph{remove}, the procedural uncertainty of
over-parameterized neural networks with very low computation effort. In
particular, our approach, based on what we call a procedural-noise-correcting
(PNC) predictor, removes the procedural uncertainty by using only \emph{one}
auxiliary network that is trained on a suitably labeled data set, instead of
many retrained networks employed in deep ensembles. Moreover, by combining our
PNC predictor with suitable light-computation resampling methods, we build
several approaches to construct asymptotically exact-coverage confidence
intervals using as low as four trained networks without additional overheads
A Primer on Bayesian Neural Networks: Review and Debates
Neural networks have achieved remarkable performance across various problem
domains, but their widespread applicability is hindered by inherent limitations
such as overconfidence in predictions, lack of interpretability, and
vulnerability to adversarial attacks. To address these challenges, Bayesian
neural networks (BNNs) have emerged as a compelling extension of conventional
neural networks, integrating uncertainty estimation into their predictive
capabilities.
This comprehensive primer presents a systematic introduction to the
fundamental concepts of neural networks and Bayesian inference, elucidating
their synergistic integration for the development of BNNs. The target audience
comprises statisticians with a potential background in Bayesian methods but
lacking deep learning expertise, as well as machine learners proficient in deep
neural networks but with limited exposure to Bayesian statistics. We provide an
overview of commonly employed priors, examining their impact on model behavior
and performance. Additionally, we delve into the practical considerations
associated with training and inference in BNNs.
Furthermore, we explore advanced topics within the realm of BNN research,
acknowledging the existence of ongoing debates and controversies. By offering
insights into cutting-edge developments, this primer not only equips
researchers and practitioners with a solid foundation in BNNs, but also
illuminates the potential applications of this dynamic field. As a valuable
resource, it fosters an understanding of BNNs and their promising prospects,
facilitating further advancements in the pursuit of knowledge and innovation.Comment: 65 page
- …