3,966 research outputs found
A unified variance-reduced accelerated gradient method for convex optimization
We propose a novel randomized incremental gradient algorithm, namely,
VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization.
Equipped with a unified step-size policy that adjusts itself to the value of
the condition number, Varag exhibits the unified optimal rates of convergence
for solving smooth convex finite-sum problems directly regardless of their
strong convexity. Moreover, Varag is the first accelerated randomized
incremental gradient method that benefits from the strong convexity of the
data-fidelity term to achieve the optimal linear convergence. It also
establishes an optimal linear rate of convergence for solving a wide class of
problems only satisfying a certain error bound condition rather than strong
convexity. Varag can also be extended to solve stochastic finite-sum problems.Comment: 33rd Conference on Neural Information Processing Systems (NeurIPS
2019
Delayed Projection Techniques for Linearly Constrained Problems: Convergence Rates, Acceleration, and Applications
In this work, we study a novel class of projection-based algorithms for
linearly constrained problems (LCPs) which have a lot of applications in
statistics, optimization, and machine learning. Conventional primal
gradient-based methods for LCPs call a projection after each (stochastic)
gradient descent, resulting in that the required number of projections equals
that of gradient descents (or total iterations). Motivated by the recent
progress in distributed optimization, we propose the delayed projection
technique that calls a projection once for a while, lowering the projection
frequency and improving the projection efficiency. Accordingly, we devise a
series of stochastic methods for LCPs using the technique, including a variance
reduced method and an accelerated one. We theoretically show that it is
feasible to improve projection efficiency in both strongly convex and generally
convex cases. Our analysis is simple and unified and can be easily extended to
other methods using delayed projections. When applying our new algorithms to
federated optimization, a newfangled and privacy-preserving subfield in
distributed optimization, we obtain not only a variance reduced federated
algorithm with convergence rates better than previous works, but also the first
accelerated method able to handle data heterogeneity inherent in federated
optimization
Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization
Recently, {\it stochastic momentum} methods have been widely adopted in
training deep neural networks. However, their convergence analysis is still
underexplored at the moment, in particular for non-convex optimization. This
paper fills the gap between practice and theory by developing a basic
convergence analysis of two stochastic momentum methods, namely stochastic
heavy-ball method and the stochastic variant of Nesterov's accelerated gradient
method. We hope that the basic convergence results developed in this paper can
serve the reference to the convergence of stochastic momentum methods and also
serve the baselines for comparison in future development of stochastic momentum
methods. The novelty of convergence analysis presented in this paper is a
unified framework, revealing more insights about the similarities and
differences between different stochastic momentum methods and stochastic
gradient method. The unified framework exhibits a continuous change from the
gradient method to Nesterov's accelerated gradient method and finally the
heavy-ball method incurred by a free parameter, which can help explain a
similar change observed in the testing error convergence behavior for deep
learning. Furthermore, our empirical results for optimizing deep neural
networks demonstrate that the stochastic variant of Nesterov's accelerated
gradient method achieves a good tradeoff (between speed of convergence in
training error and robustness of convergence in testing error) among the three
stochastic methods.Comment: Added some references and more empirical result
Estimate Sequences for Variance-Reduced Stochastic Composite Optimization
In this paper, we propose a unified view of gradient-based algorithms for
stochastic convex composite optimization by extending the concept of estimate
sequence introduced by Nesterov. This point of view covers the stochastic
gradient descent method, variants of the approaches SAGA, SVRG, and has several
advantages: (i) we provide a generic proof of convergence for the
aforementioned methods; (ii) we show that this SVRG variant is adaptive to
strong convexity; (iii) we naturally obtain new algorithms with the same
guarantees; (iv) we derive generic strategies to make these algorithms robust
to stochastic noise, which is useful when data is corrupted by small random
perturbations. Finally, we show that this viewpoint is useful to obtain new
accelerated algorithms in the sense of Nesterov.Comment: short version of preprint arXiv:1901.0878
Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction
In this paper, we propose a simple variant of the original stochastic
variance reduction gradient (SVRG), where hereafter we refer to as the variance
reduced stochastic gradient descent (VR-SGD). Different from the choices of the
snapshot point and starting point in SVRG and its proximal variant, Prox-SVRG,
the two vectors of each epoch in VR-SGD are set to the average and last iterate
of the previous epoch, respectively. This setting allows us to use much larger
learning rates or step sizes than SVRG, e.g., 3/(7L) for VR-SGD vs 1/(10L) for
SVRG, and also makes our convergence analysis more challenging. In fact, a
larger learning rate enjoyed by VR-SGD means that the variance of its
stochastic gradient estimator asymptotically approaches zero more rapidly.
Unlike common stochastic methods such as SVRG and proximal stochastic methods
such as Prox-SVRG, we design two different update rules for smooth and
non-smooth objective functions, respectively. In other words, VR-SGD can tackle
non-smooth and/or non-strongly convex problems directly without using any
reduction techniques such as quadratic regularizers. Moreover, we analyze the
convergence properties of VR-SGD for strongly convex problems, which show that
VR-SGD attains a linear convergence rate. We also provide the convergence
guarantees of VR-SGD for non-strongly convex problems. Experimental results
show that the performance of VR-SGD is significantly better than its
counterparts, SVRG and Prox-SVRG, and it is also much better than the best
known stochastic method, Katyusha.Comment: 36 pages, 10 figures. The simple variant of SVRG is much better than
the best-known stochastic method, Katyush
Boosting First-order Methods by Shifting Objective: New Schemes with Faster Worst Case Rates
We propose a new methodology to design first-order methods for unconstrained
strongly convex problems, i.e., to design for a shifted objective function.
Several technical lemmas are provided as the building blocks for designing new
methods. By shifting objective, the analysis is tightened, which leaves space
for faster rates, and also simplified. Following this methodology, we derived
several new accelerated schemes for problems that equipped with various
first-order oracles, and all of the derived methods have faster worst case
convergence rates than their existing counterparts. Experiments on machine
learning tasks are conducted to evaluate the new methods.Comment: 27 pages, 7 figure
A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent
In this paper we introduce a unified analysis of a large family of variants
of proximal stochastic gradient descent ({\tt SGD}) which so far have required
different intuitions, convergence analyses, have different applications, and
which have been developed separately in various communities. We show that our
framework includes methods with and without the following tricks, and their
combinations: variance reduction, importance sampling, mini-batch sampling,
quantization, and coordinate sub-sampling. As a by-product, we obtain the first
unified theory of {\tt SGD} and randomized coordinate descent ({\tt RCD})
methods, the first unified theory of variance reduced and non-variance-reduced
{\tt SGD} methods, and the first unified theory of quantized and non-quantized
methods. A key to our approach is a parametric assumption on the iterates and
stochastic gradients. In a single theorem we establish a linear convergence
result under this assumption and strong-quasi convexity of the loss function.
Whenever we recover an existing method as a special case, our theorem gives the
best known complexity result. Our approach can be used to motivate the
development of new useful methods, and offers pre-proved convergence
guarantees. To illustrate the strength of our approach, we develop five new
variants of {\tt SGD}, and through numerical experiments demonstrate some of
their properties.Comment: 38 pages, 4 figures, 2 table
The proximal point method revisited
In this short survey, I revisit the role of the proximal point method in
large scale optimization. I focus on three recent examples: a proximally guided
subgradient method for weakly convex stochastic approximation, the prox-linear
algorithm for minimizing compositions of convex functions and smooth maps, and
Catalyst generic acceleration for regularized Empirical Risk Minimization.Comment: 11 pages, submitted to SIAG/OPT Views and New
Stochastic variance reduced multiplicative update for nonnegative matrix factorization
Nonnegative matrix factorization (NMF), a dimensionality reduction and factor
analysis method, is a special case in which factor matrices have low-rank
nonnegative constraints. Considering the stochastic learning in NMF, we
specifically address the multiplicative update (MU) rule, which is the most
popular, but which has slow convergence property. This present paper introduces
on the stochastic MU rule a variance-reduced technique of stochastic gradient.
Numerical comparisons suggest that our proposed algorithms robustly outperform
state-of-the-art algorithms across different synthetic and real-world datasets.Comment: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP2018
Scaling-up Distributed Processing of Data Streams for Machine Learning
Emerging applications of machine learning in numerous areas involve
continuous gathering of and learning from streams of data. Real-time
incorporation of streaming data into the learned models is essential for
improved inference in these applications. Further, these applications often
involve data that are either inherently gathered at geographically distributed
entities or that are intentionally distributed across multiple machines for
memory, computational, and/or privacy reasons. Training of models in this
distributed, streaming setting requires solving stochastic optimization
problems in a collaborative manner over communication links between the
physical entities. When the streaming data rate is high compared to the
processing capabilities of compute nodes and/or the rate of the communications
links, this poses a challenging question: how can one best leverage the
incoming data for distributed training under constraints on computing
capabilities and/or communications rate? A large body of research has emerged
in recent decades to tackle this and related problems. This paper reviews
recently developed methods that focus on large-scale distributed stochastic
optimization in the compute- and bandwidth-limited regime, with an emphasis on
convergence analysis that explicitly accounts for the mismatch between
computation, communication and streaming rates. In particular, it focuses on
methods that solve: (i) distributed stochastic convex problems, and (ii)
distributed principal component analysis, which is a nonconvex problem with
geometric structure that permits global convergence. For such methods, the
paper discusses recent advances in terms of distributed algorithmic designs
when faced with high-rate streaming data. Further, it reviews guarantees
underlying these methods, which show there exist regimes in which systems can
learn from distributed, streaming data at order-optimal rates.Comment: 45 pages, 9 figures; preprint of a journal paper published in
Proceedings of the IEEE (Special Issue on Optimization for Data-driven
Learning and Control
- …