35 research outputs found
PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales
While PAC-Bayes is now an established learning framework for light-tailed
losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case
of heavy-tailed losses remains largely uncharted and has attracted a growing
interest in recent years. We contribute PAC-Bayes generalisation bounds for
heavy-tailed losses under the sole assumption of bounded variance of the loss
function. Under that assumption, we extend previous results from
\citet{kuzborskij2019efron}. Our key technical contribution is exploiting an
extention of Markov's inequality for supermartingales. Our proof technique
unifies and extends different PAC-Bayesian frameworks by providing bounds for
unbounded martingales as well as bounds for batch and online learning with
heavy-tailed losses.Comment: New Section 3 on Online PAC-Baye
The Limits of Post-Selection Generalization
While statistics and machine learning offers numerous methods for ensuring
generalization, these methods often fail in the presence of adaptivity---the
common practice in which the choice of analysis depends on previous
interactions with the same dataset. A recent line of work has introduced
powerful, general purpose algorithms that ensure post hoc generalization (also
called robust or post-selection generalization), which says that, given the
output of the algorithm, it is hard to find any statistic for which the data
differs significantly from the population it came from.
In this work we show several limitations on the power of algorithms
satisfying post hoc generalization. First, we show a tight lower bound on the
error of any algorithm that satisfies post hoc generalization and answers
adaptively chosen statistical queries, showing a strong barrier to progress in
post selection data analysis. Second, we show that post hoc generalization is
not closed under composition, despite many examples of such algorithms
exhibiting strong composition properties
On the sub-Gaussianity of the Beta and Dirichlet distributions
We obtain the optimal proxy variance for the sub-Gaussianity of Beta
distribution, thus proving upper bounds recently conjectured by Elder (2016).
We provide different proof techniques for the symmetrical (around its mean)
case and the non-symmetrical case. The technique in the latter case relies on
studying the ordinary differential equation satisfied by the Beta
moment-generating function known as the confluent hypergeometric function. As a
consequence, we derive the optimal proxy variance for the Dirichlet
distribution, which is apparently a novel result. We also provide a new proof
of the optimal proxy variance for the Bernoulli distribution, and discuss in
this context the proxy variance relation to log-Sobolev inequalities and
transport inequalities.Comment: 13 pages, 2 figure
Bandit optimisation of functions in the Mat\'ern kernel RKHS
We consider the problem of optimising functions in the reproducing kernel
Hilbert space (RKHS) of a Mat\'ern kernel with smoothness parameter over
the domain under noisy bandit feedback. Our contribution, the
-GP-UCB algorithm, is the first practical approach with guaranteed
sublinear regret for all and . Empirical validation suggests
better performance and drastically improved computational scalablity compared
with its predecessor, Improved GP-UCB.Comment: AISTATS 2020, camera read
Rate-Distortion Theoretic Bounds on Generalization Error for Distributed Learning
In this paper, we use tools from rate-distortion theory to establish new
upper bounds on the generalization error of statistical distributed learning
algorithms. Specifically, there are clients whose individually chosen
models are aggregated by a central server. The bounds depend on the
compressibility of each client's algorithm while keeping other clients'
algorithms un-compressed, and leverage the fact that small changes in each
local model change the aggregated model by a factor of only . Adopting a
recently proposed approach by Sefidgaran et al., and extending it suitably to
the distributed setting, this enables smaller rate-distortion terms which are
shown to translate into tighter generalization bounds. The bounds are then
applied to the distributed support vector machines (SVM), suggesting that the
generalization error of the distributed setting decays faster than that of the
centralized one with a factor of . This finding
is validated also experimentally. A similar conclusion is obtained for a
multiple-round federated learning setup where each client uses stochastic
gradient Langevin dynamics (SGLD).Comment: Accepted at NeurIPS 202
Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms
We consider stochastic approximations of sampling algorithms, such as
Stochastic Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM)
for Interacting Particle Dynamcs (IPD). We observe that the noise introduced by
the stochastic approximation is nearly Gaussian due to the Central Limit
Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness
this structure to absorb the stochastic approximation error inside the
diffusion process, and obtain improved convergence guarantees for these
algorithms. For SGLD, we prove the first stable convergence rate in KL
divergence without requiring uniform warm start, assuming the target density
satisfies a Log-Sobolev Inequality. Our result implies superior first-order
oracle complexity compared to prior works, under significantly milder
assumptions. We also prove the first guarantees for SGLD under even weaker
conditions such as H\"{o}lder smoothness and Poincare Inequality, thus bridging
the gap between the state-of-the-art guarantees for LMC and SGLD. Our analysis
motivates a new algorithm called covariance correction, which corrects for the
additional noise introduced by the stochastic approximation by rescaling the
strength of the diffusion. Finally, we apply our techniques to analyze RBM, and
significantly improve upon the guarantees in prior works (such as removing
exponential dependence on horizon), under minimal assumptions.Comment: Version 2 considers more results, including those for stochastic
gradient lagevin dynamics and the random batch method for interacting
particle dynamics, along with the results in the previous version. This also
contains 2 additional author