2,698 research outputs found

    Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

    Full text link
    The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models

    Primal-Dual Active-Set Methods for Isotonic Regression and Trend Filtering

    Full text link
    Isotonic regression (IR) is a non-parametric calibration method used in supervised learning. For performing large-scale IR, we propose a primal-dual active-set (PDAS) algorithm which, in contrast to the state-of-the-art Pool Adjacent Violators (PAV) algorithm, can be parallized and is easily warm-started thus well-suited in the online settings. We prove that, like the PAV algorithm, our PDAS algorithm for IR is convergent and has a work complexity of O(n), though our numerical experiments suggest that our PDAS algorithm is often faster than PAV. In addition, we propose PDAS variants (with safeguarding to ensure convergence) for solving related trend filtering (TF) problems, providing the results of experiments to illustrate their effectiveness

    Regional Complexity Analysis of Algorithms for Nonconvex Smooth Optimization

    Full text link
    A strategy is proposed for characterizing the worst-case performance of algorithms for solving nonconvex smooth optimization problems. Contemporary analyses characterize worst-case performance by providing, under certain assumptions on an objective function, an upper bound on the number of iterations (or function or derivative evaluations) required until a pth-order stationarity condition is approximately satisfied. This arguably leads to conservative characterizations based on anomalous objectives rather than on ones that are typically encountered in practice. By contrast, the strategy proposed in this paper characterizes worst-case performance separately over regions comprising a search space. These regions are defined generically based on properties of derivative values. In this manner, one can analyze the worst-case performance of an algorithm independently from any particular class of objectives. Then, once given a class of objectives, one can obtain an informative, fine-tuned complexity analysis merely by delineating the types of regions that comprise the search spaces for functions in the class. Regions defined by first- and second-order derivatives are discussed in detail and example complexity analyses are provided for a few fundamental first- and second-order algorithms when employed to minimize convex and nonconvex objectives of interest. It is also explained how the strategy can be generalized to regions defined by higher-order derivatives and for analyzing the behavior of higher-order algorithms

    Exploiting Negative Curvature in Deterministic and Stochastic Optimization

    Full text link
    This paper addresses the question of whether it can be beneficial for an optimization algorithm to follow directions of negative curvature. Although prior work has established convergence results for algorithms that integrate both descent and negative curvature steps, there has not yet been extensive numerical evidence showing that such methods offer consistent performance improvements. In this paper, we present new frameworks for combining descent and negative curvature directions: alternating two-step approaches and dynamic step approaches. The aspect that distinguishes our approaches from ones previously proposed is that they make algorithmic decisions based on (estimated) upper-bounding models of the objective function. A consequence of this aspect is that our frameworks can, in theory, employ fixed stepsizes, which makes the methods readily translatable from deterministic to stochastic settings. For deterministic problems, we show that instances of our dynamic framework yield gains in performance compared to related methods that only follow descent steps. We also show that gains can be made in a stochastic setting in cases when a standard stochastic-gradient-type method might make slow progress

    Optimization Methods for Large-Scale Machine Learning

    Full text link
    This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations

    ADMM for Multiaffine Constrained Optimization

    Full text link
    We expand the scope of the alternating direction method of multipliers (ADMM). Specifically, we show that ADMM, when employed to solve problems with multiaffine constraints that satisfy certain verifiable assumptions, converges to the set of constrained stationary points if the penalty parameter in the augmented Lagrangian is sufficiently large. When the Kurdyka-\L{}ojasiewicz (K-\L{}) property holds, this is strengthened to convergence to a single constrained stationary point. Our analysis applies under assumptions that we have endeavored to make as weak as possible. It applies to problems that involve nonconvex and/or nonsmooth objective terms, in addition to the multiaffine constraints that can involve multiple (three or more) blocks of variables. To illustrate the applicability of our results, we describe examples including nonnegative matrix factorization, sparse learning, risk parity portfolio selection, nonconvex formulations of convex problems, and neural network training. In each case, our ADMM approach encounters only subproblems that have closed-form solutions.Comment: v3: 37 pages, 7 figures v2: 32 pages, 0 figures. v1: 26 pages, 0 figure

    A Stochastic Trust Region Algorithm Based on Careful Step Normalization

    Full text link
    An algorithm is proposed for solving stochastic and finite sum minimization problems. Based on a trust region methodology, the algorithm employs normalized steps, at least as long as the norms of the stochastic gradient estimates are within a specified interval. The complete algorithm---which dynamically chooses whether or not to employ normalized steps---is proved to have convergence guarantees that are similar to those possessed by a traditional stochastic gradient approach under various sets of conditions related to the accuracy of the stochastic gradient estimates and choice of stepsize sequence. The results of numerical experiments are presented when the method is employed to minimize convex and nonconvex machine learning test problems. These results illustrate that the method can outperform a traditional stochastic gradient approach

    A Reduced-Space Algorithm for Minimizing â„“1\ell_1-Regularized Convex Functions

    Full text link
    We present a new method for minimizing the sum of a differentiable convex function and an â„“1\ell_1-norm regularizer. The main features of the new method include: (i)(i) an evolving set of indices corresponding to variables that are predicted to be nonzero at a solution (i.e., the support); (ii)(ii) a reduced-space subproblem defined in terms of the predicted support; (iii)(iii) conditions that determine how accurately each subproblem must be solved, which allow for Newton, Newton-CG, and coordinate-descent techniques to be employed; (iv)(iv) a computationally practical condition that determines when the predicted support should be updated; and (v)(v) a reduced proximal gradient step that ensures sufficient decrease in the objective function when it is decided that variables should be added to the predicted support. We prove a convergence guarantee for our method and demonstrate its efficiency on a large set of model prediction problems

    Complexity Analysis of a Trust Funnel Algorithm for Equality Constrained Optimization

    Full text link
    A method is proposed for solving equality constrained nonlinear optimization problems involving twice continuously differentiable functions. The method employs a trust funnel approach consisting of two phases: a first phase to locate an ϵ\epsilon-feasible point and a second phase to seek optimality while maintaining at least ϵ\epsilon-feasibility. A two-phase approach of this kind based on a cubic regularization methodology was recently proposed along with a supporting worst-case iteration complexity analysis. Unfortunately, however, in that approach, the objective function is completely ignored in the first phase when ϵ\epsilon-feasibility is sought. The main contribution of the method proposed in this paper is that the same worst-case iteration complexity is achieved, but with a first phase that also accounts for improvements in the objective function. As such, the method typically requires fewer iterations in the second phase, as the results of numerical experiments demonstrate

    An Inexact Regularized Newton Framework with a Worst-Case Iteration Complexity of O(ϵ−3/2)\mathcal{O}(\epsilon^{-3/2}) for Nonconvex Optimization

    Full text link
    An algorithm for solving smooth nonconvex optimization problems is proposed that, in the worst-case, takes O(ϵ−3/2)\mathcal{O}(\epsilon^{-3/2}) iterations to drive the norm of the gradient of the objective function below a prescribed positive real number ϵ\epsilon and can take O(ϵ−3)\mathcal{O}(\epsilon^{-3}) iterations to drive the leftmost eigenvalue of the Hessian of the objective above −ϵ-\epsilon. The proposed algorithm is a general framework that covers a wide range of techniques including quadratically and cubically regularized Newton methods, such as the Adaptive Regularisation using Cubics (ARC) method and the recently proposed Trust-Region Algorithm with Contractions and Expansions (TRACE). The generality of our method is achieved through the introduction of generic conditions that each trial step is required to satisfy, which in particular allow for inexact regularized Newton steps to be used. These conditions center around a new subproblem that can be approximately solved to obtain trial steps that satisfy the conditions. A new instance of the framework, distinct from ARC and TRACE, is described that may be viewed as a hybrid between quadratically and cubically regularized Newton methods. Numerical results demonstrate that our hybrid algorithm outperforms a cublicly regularized Newton method
    • …
    corecore