37 research outputs found
Nonparametric Linear Feature Learning in Regression Through Regularisation
Representation learning plays a crucial role in automated feature selection,
particularly in the context of high-dimensional data, where non-parametric
methods often struggle. In this study, we focus on supervised learning
scenarios where the pertinent information resides within a lower-dimensional
linear subspace of the data, namely the multi-index model. If this subspace
were known, it would greatly enhance prediction, computation, and
interpretation. To address this challenge, we propose a novel method for linear
feature learning with non-parametric prediction, which simultaneously estimates
the prediction function and the linear subspace. Our approach employs empirical
risk minimisation, augmented with a penalty on function derivatives, ensuring
versatility. Leveraging the orthogonality and rotation invariance properties of
Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising
alternative minimisation, we iteratively rotate the data to improve alignment
with leading directions and accurately estimate the relevant dimension in
practical settings. We establish that our method yields a consistent estimator
of the prediction function with explicit rates. Additionally, we provide
empirical results demonstrating the performance of RegFeaL in various
experiments.Comment: 42 pages, 5 figure
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
A recent line of empirical studies has demonstrated that SGD might exhibit a
heavy-tailed behavior in practical settings, and the heaviness of the tails
might correlate with the overall performance. In this paper, we investigate the
emergence of such heavy tails. Previous works on this problem only considered,
up to our knowledge, online (also called single-pass) SGD, in which the
emergence of heavy tails in theoretical findings is contingent upon access to
an infinite amount of data. Hence, the underlying mechanism generating the
reported heavy-tailed behavior in practical settings, where the amount of
training data is finite, is still not well-understood. Our contribution aims to
fill this gap. In particular, we show that the stationary distribution of
offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and
the approximation error is controlled by how fast the empirical distribution of
the training data converges to the true underlying data distribution in the
Wasserstein metric. Our main takeaway is that, as the number of data points
increases, offline SGD will behave increasingly 'power-law-like'. To achieve
this result, we first prove nonasymptotic Wasserstein convergence bounds for
offline SGD to online SGD as the number of data points increases, which can be
interesting on their own. Finally, we illustrate our theory on various
experiments conducted on synthetic data and neural networks.Comment: In Neural Information Processing Systems (NeurIPS), Spotlight
Presentation, 202
SGD with Clipping is Secretly Estimating the Median Gradient
There are several applications of stochastic optimization where one can
benefit from a robust estimate of the gradient. For example, domains such as
distributed learning with corrupted nodes, the presence of large outliers in
the training data, learning under privacy constraints, or even heavy-tailed
noise due to the dynamics of the algorithm itself. Here we study SGD with
robust gradient estimators based on estimating the median. We first consider
computing the median gradient across samples, and show that the resulting
method can converge even under heavy-tailed, state-dependent noise. We then
derive iterative methods based on the stochastic proximal point method for
computing the geometric median and generalizations thereof. Finally we propose
an algorithm estimating the median gradient across iterations, and find that
several well known methods - in particular different forms of clipping - are
particular cases of this framework
Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for
deriving generalization bounds for practical algorithms. The last decade has
witnessed an increasing number of stability bounds for different algorithms
applied on different classes of loss functions. While these bounds have
illuminated various properties of optimization algorithms, the analysis of each
case typically required a different proof technique with significantly
different mathematical tools. In this study, we make a novel connection between
learning theory and applied probability and introduce a unified guideline for
proving Wasserstein stability bounds for stochastic optimization algorithms. We
illustrate our approach on stochastic gradient descent (SGD) and we obtain
time-uniform stability bounds (i.e., the bound does not increase with the
number of iterations) for strongly convex losses and non-convex losses with
additive noise, where we recover similar results to the prior art or extend
them to more general cases by using a single proof technique. Our approach is
flexible and can be generalizable to other popular optimizers, as it mainly
requires developing Lyapunov functions, which are often readily available in
the literature. It also illustrates that ergodicity is an important component
for obtaining time-uniform bounds -- which might not be achieved for convex or
non-convex losses unless additional noise is injected to the iterates. Finally,
we slightly stretch our analysis technique and prove time-uniform bounds for
SGD under convex and non-convex losses (without additional additive noise),
which, to our knowledge, is novel.Comment: 49 pages, NeurIPS 202
Efficient Bayesian Model Selection in PARAFAC via Stochastic Thermodynamic Integration
International audienceParallel factor analysis (PARAFAC) is one of the most popular tensor factorization models. Even though it has proven successful in diverse application fields, the performance of PARAFAC usually hinges up on the rank of the factorization, which is typically specified manually by the practitioner. In this study, we develop a novel parallel and distributed Bayesian model selection technique for rank estimation in large-scale PARAFAC models. The proposed approach integrates ideas from the emerging field of stochastic gradient Markov Chain Monte Carlo, statistical physics, and distributed stochastic optimization. As opposed to the existing methods, which are based on some heuristics, our method has a clear mathematical interpretation, and has significantly lower computational requirements, thanks to data subsampling and parallelization. We provide formal theoretical analysis on the bias induced by the proposed approach. Our experiments on synthetic and large-scale real datasets show that our method is able to find the optimal model order while being significantly faster than the state-of-the-art