279 research outputs found

    Sharpened Lazy Incremental Quasi-Newton Method

    Full text link
    We consider the finite sum minimization of nn strongly convex and smooth functions with Lipschitz continuous Hessians in dd dimensions. In many applications where such problems arise, including maximum likelihood estimation, empirical risk minimization, and unsupervised learning, the number of observations nn is large, and it becomes necessary to use incremental or stochastic algorithms whose per-iteration complexity is independent of nn. Of these, the incremental/stochastic variants of the Newton method exhibit superlinear convergence, but incur a per-iteration complexity of O(d3)O(d^3), which may be prohibitive in large-scale settings. On the other hand, the incremental Quasi-Newton method incurs a per-iteration complexity of O(d2)O(d^2) but its superlinear convergence rate has only been characterized asymptotically. This work puts forth the Sharpened Lazy Incremental Quasi-Newton (SLIQN) method that achieves the best of both worlds: an explicit superlinear convergence rate with a per-iteration complexity of O(d2)O(d^2). Building upon the recently proposed Sharpened Quasi-Newton method, the proposed incremental variant incorporates a hybrid update strategy incorporating both classic and greedy BFGS updates. The proposed lazy update rule distributes the computational complexity between the iterations, so as to enable a per-iteration complexity of O(d2)O(d^2). Numerical tests demonstrate the superiority of SLIQN over all other incremental and stochastic Quasi-Newton variants.Comment: 39 pages, 3 figure

    Efficient Methods For Large-Scale Empirical Risk Minimization

    Get PDF
    Empirical risk minimization (ERM) problems express optimal classifiers as solutions of optimization problems in which the objective is the sum of a very large number of sample costs. An evident obstacle in using traditional descent algorithms for solving this class of problems is their prohibitive computational complexity when the number of component functions in the ERM problem is large. The main goal of this thesis is to study different approaches to solve these large-scale ERM problems. We begin by focusing on incremental and stochastic methods which split the training samples into smaller sets across time to lower the computation burden of traditional descent algorithms. We develop and analyze convergent stochastic variants of quasi-Newton methods which do not require computation of the objective Hessian and approximate the curvature using only gradient information. We show that the curvature approximation in stochastic quasi-Newton methods leads to faster convergence relative to first-order stochastic methods when the problem is ill-conditioned. We culminate with the introduction of an incremental method that exploits memory to achieve a superlinear convergence rate. This is the best known convergence rate for an incremental method. An alternative strategy for lowering the prohibitive cost of solving large-scale ERM problems is decentralized optimization whereby samples are separated not across time but across multiple nodes of a network. In this regime, the main contribution of this thesis is in incorporating second-order information of the aggregate risk corresponding to samples of all nodes in the network in a way that can be implemented in a distributed fashion. We also explore the separation of samples across both, time and space, to reduce the computational and communication cost for solving large-scale ERM problems. We study this path by introducing a decentralized stochastic method which incorporates the idea of stochastic averaging gradient leading to a low computational complexity method with a fast linear convergence rate. We then introduce a rethinking of ERM in which we consider not a partition of the training set as in the case of stochastic and distributed optimization, but a nested collection of subsets that we grow geometrically. The key insight is that the optimal argument associated with a training subset of a certain size is not that far from the optimal argument associated with a larger training subset. Based on this insight, we present adaptive sample size schemes which start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. We further show that if we couple the adaptive sample size scheme with Newton\u27s method, it is possible to consider subsequent doubling of the training set and perform a single Newton iteration in between. This is possible because of the interplay between the statistical accuracy and the quadratic convergence region of these problems and yields a method that is guaranteed to solve an ERM problem by performing just two passes over the dataset

    Accelerating Incremental Gradient Optimization with Curvature Information

    Full text link
    This paper studies an acceleration technique for incremental aggregated gradient ({\sf IAG}) method through the use of \emph{curvature} information for solving strongly convex finite sum optimization problems. These optimization problems of interest arise in large-scale learning applications. Our technique utilizes a curvature-aided gradient tracking step to produce accurate gradient estimates incrementally using Hessian information. We propose and analyze two methods utilizing the new technique, the curvature-aided IAG ({\sf CIAG}) method and the accelerated CIAG ({\sf A-CIAG}) method, which are analogous to gradient method and Nesterov's accelerated gradient method, respectively. Setting κ\kappa to be the condition number of the objective function, we prove the RR linear convergence rates of 1−4c0κ(κ+1)21 - \frac{4c_0 \kappa}{(\kappa+1)^2} for the {\sf CIAG} method, and 1−c12κ1 - \sqrt{\frac{c_1}{2\kappa}} for the {\sf A-CIAG} method, where c0,c1≤1c_0,c_1 \leq 1 are constants inversely proportional to the distance between the initial point and the optimal solution. When the initial iterate is close to the optimal solution, the RR linear convergence rates match with the gradient and accelerated gradient method, albeit {\sf CIAG} and {\sf A-CIAG} operate in an incremental setting with strictly lower computation complexity. Numerical experiments confirm our findings. The source codes used for this paper can be found on \url{http://github.com/hoitowai/ciag/}.Comment: 22 pages, 3 figures, 3 tables. Accepted by Computational Optimization and Applications, to appea
    • …
    corecore