706 research outputs found
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
Rest-Katyusha: Exploiting the Solution's Structure via Scheduled Restart Schemes
We propose a structure-adaptive variant of the state-of-the-art stochastic
variance-reduced gradient algorithm Katyusha for regularized empirical risk
minimization. The proposed method is able to exploit the intrinsic
low-dimensional structure of the solution, such as sparsity or low rank which
is enforced by a non-smooth regularization, to achieve even faster convergence
rate. This provable algorithmic improvement is done by restarting the Katyusha
algorithm according to restricted strong-convexity constants. We demonstrate
the effectiveness of our approach via numerical experiments
Convergence Analysis of Accelerated Stochastic Gradient Descent under the Growth Condition
We study the convergence of accelerated stochastic gradient descent for
strongly convex objectives under the growth condition, which states that the
variance of stochastic gradient is bounded by a multiplicative part that grows
with the full gradient, and a constant additive part. Through the lens of the
growth condition, we investigate four widely used accelerated methods:
Nesterov's accelerated method (NAM), robust momentum method (RMM), accelerated
dual averaging method (ADAM), and implicit ADAM (iADAM). While these methods
are known to improve the convergence rate of SGD under the condition that the
stochastic gradient has bounded variance, it is not well understood how their
convergence rates are affected by the multiplicative noise. In this paper, we
show that these methods all converge to a neighborhood of the optimum with
accelerated convergence rates (compared to SGD) even under the growth
condition. In particular, NAM, RMM, iADAM enjoy acceleration only with a mild
multiplicative noise, while ADAM enjoys acceleration even with a large
multiplicative noise. Furthermore, we propose a generic tail-averaged scheme
that allows the accelerated rates of ADAM and iADAM to nearly attain the
theoretical lower bound (up to a logarithmic factor in the variance term)
Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization
We consider a generic convex optimization problem associated with regularized
empirical risk minimization of linear predictors. The problem structure allows
us to reformulate it as a convex-concave saddle point problem. We propose a
stochastic primal-dual coordinate (SPDC) method, which alternates between
maximizing over a randomly chosen dual variable and minimizing over the primal
variable. An extrapolation step on the primal variable is performed to obtain
accelerated convergence rate. We also develop a mini-batch version of the SPDC
method which facilitates parallel computing, and an extension with weighted
sampling probabilities on the dual variables, which has a better complexity
than uniform sampling on unnormalized data. Both theoretically and empirically,
we show that the SPDC method has comparable or better performance than several
state-of-the-art optimization methods
Distributed Algorithms in Large-scaled Empirical Risk Minimization: Non-convexity, Adaptive-sampling, and Matrix-free Second-order Methods
The rising amount of data has changed the classical approaches in statistical modeling significantly. Special methods are designed for inferring meaningful relationships and hidden patterns from these large datasets, which build the foundation of a study called Machine Learning (ML). Such ML techniques have already applied widely in various areas and achieved compelling success. In the meantime, the huge amount of data also requires a deep revolution of current techniques, like the availability of advanced data storage, new efficient large-scale algorithms, and their distributed/parallelized implementation.There is a broad class of ML methods can be interpreted as Empirical Risk Minimization (ERM) problems. When utilizing various loss functions and likely necessary regularization terms, one could approach their specific ML goals by solving ERMs as separable finite sum optimization problems. There are circumstances where the nonconvex component is introduced into the ERMs which usually makes the problems hard to optimize. Especially, in recent years, neural networks, a popular branch of ML, draw numerous attention from the community. Neural networks are powerful and highly flexible inspired by the structured functionality of the brain. Typically, neural networks could be treated as large-scale and highly nonconvex ERMs.While as nonconvex ERMs become more complex and larger in scales, optimization using stochastic gradient descent (SGD) type methods proceeds slowly regarding its convergence rate and incapability of being distributed efficiently. It motivates researchers to explore more advanced local optimization methods such as approximate-Newton/second-order methods.In this dissertation, first-order stochastic optimization for the regularized ERMs in Chapter1 is studied. Based on the development of stochastic dual coordinate accent (SDCA) method, a dual free SDCA with non-uniform mini-batch sampling strategy is investigated [30, 29]. We also introduce several efficient algorithms for training ERMs, including neural networks, using second-order optimization methods in a distributed environment. In Chapter 2, we propose a practical distributed implementation for Newton-CG methods. It makes training neural networks by second-order methods doable in the distributed environment [28]. In Chapter 3, we further build steps towards using second-order methods to train feed-forward neural networks with negative curvature direction utilization and momentum acceleration. In this Chapter, we also report numerical experiments for comparing second-order methods and first-order methods regarding training neural networks. The following Chapter 4 purpose an distributed accumulative sample-size second-order methods for solving large-scale convex ERMs and nonconvex neural networks [35]. In Chapter 5, a python library named UCLibrary is briefly introduced for solving unconstrained optimization problems. This dissertation is all concluded in the last Chapter 6
- …