68 research outputs found
Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains
We consider the minimization of an objective function given access to
unbiased estimates of its gradient through stochastic gradient descent (SGD)
with constant step-size. While the detailed analysis was only performed for
quadratic functions, we provide an explicit asymptotic expansion of the moments
of the averaged SGD iterates that outlines the dependence on initial
conditions, the effect of noise and the step-size, as well as the lack of
convergence in the general (non-quadratic) case. For this analysis, we bring
tools from Markov chain theory into the analysis of stochastic gradient. We
then show that Richardson-Romberg extrapolation may be used to get closer to
the global optimum and we show empirical improvements of the new extrapolation
scheme
Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees
We introduce a framework - Artemis - to tackle the problem of learning in a
distributed or federated setting with communication constraints and device
partial Several workers (randomly sampled) perform the optimization process
using a central server to aggregate their computations. To alleviate the
communication cost, Artemis allows to compresses the information sent in both
directions (from the workers to the server and conversely) combined with a
memory It improves on existing algorithms that only consider unidirectional
compression (to the server), or use very strong assumptions on the compression
operator, and often do not take into account devices partial participation. We
provide fast rates of convergence (linear up to a threshold) under weak
assumptions on the stochastic gradients (noise's variance bounded only at
optimal point) in non-i.i.d. setting, highlight the impact of memory for
unidirectional and bidirectional compression, analyze Polyak-Ruppert averaging.
We use convergence in distribution to obtain a lower bound of the asymptotic
variance that highlights practical limits of compression. And we provide
experimental results to demonstrate the validity of our analysis.Comment: 56 pages, 4 theorems, 1 algorithm, code source on GitHu
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose
gradients are only accessible through a stochastic oracle that returns the
gradient at any given point plus a zero-mean finite variance random error. We
present the first algorithm that achieves jointly the optimal prediction error
rates for least-squares regression, both in terms of forgetting of initial
conditions in O(1/n 2), and in terms of dependence on the noise and dimension d
of the problem, as O(d/n). Our new algorithm is based on averaged accelerated
regularized gradient descent, and may also be analyzed through finer
assumptions on initial conditions and the Hessian matrix, leading to
dimension-free quantities that may still be small while the " optimal " terms
above are large. In order to characterize the tightness of these new bounds, we
consider an application to non-parametric regression and use the known lower
bounds on the statistical performance (without computational limits), which
happen to match our bounds obtained from a single pass on the data and thus
show optimality of our algorithm in a wide variety of particular trade-offs
between bias and variance
Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning
In this paper, we investigate the impact of compression on stochastic
gradient algorithms for machine learning, a technique widely used in
distributed and federated learning. We underline differences in terms of
convergence rates between several unbiased compression operators, that all
satisfy the same condition on their variance, thus going beyond the classical
worst-case analysis. To do so, we focus on the case of least-squares regression
(LSR) and analyze a general stochastic approximation algorithm for minimizing
quadratic functions relying on a random field. We consider weak assumptions on
the random field, tailored to the analysis (specifically, expected H\"older
regularity), and on the noise covariance, enabling the analysis of various
randomizing mechanisms, including compression. We then extend our results to
the case of federated learning.
More formally, we highlight the impact on the convergence of the covariance
of the additive noise induced by the algorithm.
We demonstrate despite the non-regularity of the stochastic field, that the
limit variance term scales with (where is the Hessian of the optimization problem and the
number of iterations) generalizing the rate for the vanilla LSR case where it
is (Bach and Moulines,
2013). Then, we analyze the dependency of on the
compression strategy and ultimately its impact on convergence, first in the
centralized case, then in two heterogeneous FL frameworks
Provable non-accelerations of the heavy-ball method
In this work, we show that the heavy-ball (\HB) method provably does not
reach an accelerated convergence rate on smooth strongly convex problems. More
specifically, we show that for any condition number and any choice of
algorithmic parameters, either the worst-case convergence rate of \HB on the
class of -smooth and -strongly convex \textit{quadratic} functions is
not accelerated (that is, slower than ), or there
exists an -smooth -strongly convex function and an initialization such
that the method does not converge.
To the best of our knowledge, this result closes a simple yet open question
on one of the most used and iconic first-order optimization technique.
Our approach builds on finding functions for which \HB fails to converge
and instead cycles over finitely many iterates. We analytically describe all
parametrizations of \HB that exhibit this cycling behavior on a particular
cycle shape, whose choice is supported by a systematic and constructive
approach to the study of cycling behaviors of first-order methods. We show the
robustness of our results to perturbations of the cycle, and extend them to
class of functions that also satisfy higher-order regularity conditions
On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent
Constant step-size Stochastic Gradient Descent exhibits two phases: a
transient phase during which iterates make fast progress towards the optimum,
followed by a stationary phase during which iterates oscillate around the
optimal point. In this paper, we show that efficiently detecting this
transition and appropriately decreasing the step size can lead to fast
convergence rates. We analyse the classical statistical test proposed by Pflug
(1983), based on the inner product between consecutive stochastic gradients.
Even in the simple case where the objective function is quadratic we show that
this test cannot lead to an adequate convergence diagnostic. We then propose a
novel and simple statistical procedure that accurately detects stationarity and
we provide experimental results showing state-of-the-art performance on
synthetic and real-world datasets
Non-parametric Stochastic Approximation with Large Step sizes
We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS , even if the optimal predictor (i.e., the conditional expectation) is not in . In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in
Proving Linear Mode Connectivity of Neural Networks via Optimal Transport
The energy landscape of high-dimensional non-convex optimization problems is
crucial to understanding the effectiveness of modern deep neural network
architectures. Recent works have experimentally shown that two different
solutions found after two runs of a stochastic training are often connected by
very simple continuous paths (e.g., linear) modulo a permutation of the
weights. In this paper, we provide a framework theoretically explaining this
empirical observation. Based on convergence rates in Wasserstein distance of
empirical measures, we show that, with high probability, two wide enough
two-layer neural networks trained with stochastic gradient descent are linearly
connected. Additionally, we express upper and lower bounds on the width of each
layer of two deep neural networks with independent neuron weights to be
linearly connected. Finally, we empirically demonstrate the validity of our
approach by showing how the dimension of the support of the weight distribution
of neurons, which dictates Wasserstein convergence rates is correlated with
linear mode connectivity
Context Mover's Distance & Barycenters: Optimal Transport of Contexts for Building Representations
We present a framework for building unsupervised representations of entities
and their compositions, where each entity is viewed as a probability
distribution rather than a vector embedding. In particular, this distribution
is supported over the contexts which co-occur with the entity and are embedded
in a suitable low-dimensional space. This enables us to consider representation
learning from the perspective of Optimal Transport and take advantage of its
tools such as Wasserstein distance and barycenters. We elaborate how the method
can be applied for obtaining unsupervised representations of text and
illustrate the performance (quantitatively as well as qualitatively) on tasks
such as measuring sentence similarity, word entailment and similarity, where we
empirically observe significant gains (e.g., 4.1% relative improvement over
Sent2vec, GenSen).
The key benefits of the proposed approach include: (a) capturing uncertainty
and polysemy via modeling the entities as distributions, (b) utilizing the
underlying geometry of the particular task (with the ground cost), (c)
simultaneously providing interpretability with the notion of optimal transport
between contexts and (d) easy applicability on top of existing point embedding
methods. The code, as well as prebuilt histograms, are available under
https://github.com/context-mover/.Comment: AISTATS 2020. Also, accepted previously at ICLR 2019 DeepGenStruct
Worksho
- âŠ