28 research outputs found
Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice
Meta-Learning aims to speed up the learning process on new tasks by acquiring
useful inductive biases from datasets of related learning tasks. While, in
practice, the number of related tasks available is often small, most of the
existing approaches assume an abundance of tasks; making them unrealistic and
prone to overfitting. A central question in the meta-learning literature is how
to regularize to ensure generalization to unseen tasks. In this work, we
provide a theoretical analysis using the PAC-Bayesian theory and present a
generalization bound for meta-learning, which was first derived by Rothfuss et
al. (2021). Crucially, the bound allows us to derive the closed form of the
optimal hyper-posterior, referred to as PACOH, which leads to the best
performance guarantees. We provide a theoretical analysis and empirical case
study under which conditions and to what extent these guarantees for
meta-learning improve upon PAC-Bayesian per-task learning bounds. The
closed-form PACOH inspires a practical meta-learning approach that avoids the
reliance on bi-level optimization, giving rise to a stochastic optimization
problem that is amenable to standard variational methods that scale well. Our
experiments show that, when instantiating the PACOH with Gaussian processes and
Bayesian Neural Networks models, the resulting methods are more scalable, and
yield state-of-the-art performance, both in terms of predictive accuracy and
the quality of uncertainty estimates.Comment: 61 pages. arXiv admin note: text overlap with arXiv:2002.0555
Lifelong Bandit Optimization: No Prior and No Regret
In practical applications, machine learning algorithms are often repeatedly
applied to problems with similar structure over and over again. We focus on
solving a sequence of bandit optimization tasks and develop LiBO, an algorithm
which adapts to the environment by learning from past experience and becoming
more sample-efficient in the process. We assume a kernelized structure where
the kernel is unknown but shared across all tasks. LiBO sequentially
meta-learns a kernel that approximates the true kernel and simultaneously
solves the incoming tasks with the latest kernel estimate. Our algorithm can be
paired with any kernelized bandit algorithm and guarantees oracle optimal
performance, meaning that as more tasks are solved, the regret of LiBO on each
task converges to the regret of the bandit algorithm with oracle knowledge of
the true kernel. Naturally, if paired with a sublinear bandit algorithm, LiBO
yields a sublinear lifelong regret. We also show that direct access to the data
from each task is not necessary for attaining sublinear regret. The lifelong
problem can thus be solved in a federated manner, while keeping the data of
each task private.Comment: 32 pages, 6 figures, preprin
Instance-Dependent Generalization Bounds via Optimal Transport
Existing generalization bounds fail to explain crucial factors that drive
generalization of modern neural networks. Since such bounds often hold
uniformly over all parameters, they suffer from over-parametrization, and fail
to account for the strong inductive bias of initialization and stochastic
gradient descent. As an alternative, we propose a novel optimal transport
interpretation of the generalization problem. This allows us to derive
instance-dependent generalization bounds that depend on the local Lipschitz
regularity of the earned prediction function in the data space. Therefore, our
bounds are agnostic to the parametrization of the model and work well when the
number of training samples is much smaller than the number of parameters. With
small modifications, our approach yields accelerated rates for data on
low-dimensional manifolds, and guarantees under distribution shifts. We
empirically analyze our generalization bounds for neural networks, showing that
the bound values are meaningful and capture the effect of popular
regularization methods during training.Comment: 50 pages, 7 figure
Noise Regularization for Conditional Density Estimation
Modelling statistical relationships beyond the conditional mean is crucial in
many settings. Conditional density estimation (CDE) aims to learn the full
conditional probability density from data. Though highly expressive, neural
network based CDE models can suffer from severe over-fitting when trained with
the maximum likelihood objective. Due to the inherent structure of such models,
classical regularization approaches in the parameter space are rendered
ineffective. To address this issue, we develop a model-agnostic noise
regularization method for CDE that adds random perturbations to the data during
training. We demonstrate that the proposed approach corresponds to a smoothness
regularization and prove its asymptotic consistency. In our experiments, noise
regularization significantly and consistently outperforms other regularization
methods across seven data sets and three CDE models. The effectiveness of noise
regularization makes neural network based CDE the preferable method over
previous non- and semi-parametric approaches, even when training data is
scarce