279 research outputs found
Sharpened Lazy Incremental Quasi-Newton Method
We consider the finite sum minimization of strongly convex and smooth
functions with Lipschitz continuous Hessians in dimensions. In many
applications where such problems arise, including maximum likelihood
estimation, empirical risk minimization, and unsupervised learning, the number
of observations is large, and it becomes necessary to use incremental or
stochastic algorithms whose per-iteration complexity is independent of . Of
these, the incremental/stochastic variants of the Newton method exhibit
superlinear convergence, but incur a per-iteration complexity of ,
which may be prohibitive in large-scale settings. On the other hand, the
incremental Quasi-Newton method incurs a per-iteration complexity of
but its superlinear convergence rate has only been characterized
asymptotically. This work puts forth the Sharpened Lazy Incremental
Quasi-Newton (SLIQN) method that achieves the best of both worlds: an explicit
superlinear convergence rate with a per-iteration complexity of .
Building upon the recently proposed Sharpened Quasi-Newton method, the proposed
incremental variant incorporates a hybrid update strategy incorporating both
classic and greedy BFGS updates. The proposed lazy update rule distributes the
computational complexity between the iterations, so as to enable a
per-iteration complexity of . Numerical tests demonstrate the
superiority of SLIQN over all other incremental and stochastic Quasi-Newton
variants.Comment: 39 pages, 3 figure
Efficient Methods For Large-Scale Empirical Risk Minimization
Empirical risk minimization (ERM) problems express optimal classifiers as solutions of optimization problems in which the objective is the sum of a very large number of sample costs. An evident obstacle in using traditional descent algorithms for solving this class of problems is their prohibitive computational complexity when the number of component functions in the ERM problem is large. The main goal of this thesis is to study different approaches to solve these large-scale ERM problems.
We begin by focusing on incremental and stochastic methods which split the training samples into smaller sets across time to lower the computation burden of traditional descent algorithms. We develop and analyze convergent stochastic variants of quasi-Newton methods which do not require computation of the objective Hessian and approximate the curvature using only gradient information. We show that the curvature approximation in stochastic quasi-Newton methods leads to faster convergence relative to first-order stochastic methods when the problem is ill-conditioned. We culminate with the introduction of an incremental method that exploits memory to achieve a superlinear convergence rate. This is the best known convergence rate for an incremental method.
An alternative strategy for lowering the prohibitive cost of solving large-scale ERM problems is decentralized optimization whereby samples are separated not across time but across multiple nodes of a network. In this regime, the main contribution of this thesis is in incorporating second-order information of the aggregate risk corresponding to samples of all nodes in the network in a way that can be implemented in a distributed fashion. We also explore the separation of samples across both, time and space, to reduce the computational and communication cost for solving large-scale ERM problems. We study this path by introducing a decentralized stochastic method which incorporates the idea of stochastic averaging gradient leading to a low computational complexity method with a fast linear convergence rate.
We then introduce a rethinking of ERM in which we consider not a partition of the training set as in the case of stochastic and distributed optimization, but a nested collection of subsets that we grow geometrically. The key insight is that the optimal argument associated with a training subset of a certain size is not that far from the optimal argument associated with a larger training subset. Based on this insight, we present adaptive sample size schemes which start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. We further show that if we couple the adaptive sample size scheme with Newton\u27s method, it is possible to consider subsequent doubling of the training set and perform a single Newton iteration in between. This is possible because of the interplay between the statistical accuracy and the quadratic convergence region of these problems and yields a method that is guaranteed to solve an ERM problem by performing just two passes over the dataset
Accelerating Incremental Gradient Optimization with Curvature Information
This paper studies an acceleration technique for incremental aggregated
gradient ({\sf IAG}) method through the use of \emph{curvature} information for
solving strongly convex finite sum optimization problems. These optimization
problems of interest arise in large-scale learning applications. Our technique
utilizes a curvature-aided gradient tracking step to produce accurate gradient
estimates incrementally using Hessian information. We propose and analyze two
methods utilizing the new technique, the curvature-aided IAG ({\sf CIAG})
method and the accelerated CIAG ({\sf A-CIAG}) method, which are analogous to
gradient method and Nesterov's accelerated gradient method, respectively.
Setting to be the condition number of the objective function, we prove
the linear convergence rates of for
the {\sf CIAG} method, and for the {\sf
A-CIAG} method, where are constants inversely proportional to
the distance between the initial point and the optimal solution. When the
initial iterate is close to the optimal solution, the linear convergence
rates match with the gradient and accelerated gradient method, albeit {\sf
CIAG} and {\sf A-CIAG} operate in an incremental setting with strictly lower
computation complexity. Numerical experiments confirm our findings. The source
codes used for this paper can be found on
\url{http://github.com/hoitowai/ciag/}.Comment: 22 pages, 3 figures, 3 tables. Accepted by Computational Optimization
and Applications, to appea
- …