3,620 research outputs found

    Optimization Methods for Large-Scale Machine Learning

    Full text link
    This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations

    Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

    Full text link
    Progress in deep learning is slowed by the days or weeks it takes to train large models. The natural solution of using more hardware is limited by diminishing returns, and leads to inefficient use of additional resources. In this paper, we present a large batch, stochastic optimization algorithm that is both faster than widely used algorithms for fixed amounts of computation, and also scales up substantially better as more computational resources become available. Our algorithm implicitly computes the inverse Hessian of each mini-batch to produce descent directions; we do so without either an explicit approximation to the Hessian or Hessian-vector products. We demonstrate the effectiveness of our algorithm by successfully training large ImageNet models (Inception-V3, Resnet-50, Resnet-101 and Inception-Resnet-V2) with mini-batch sizes of up to 32000 with no loss in validation error relative to current baselines, and no increase in the total number of steps. At smaller mini-batch sizes, our optimizer improves the validation error in these models by 0.8-0.9%. Alternatively, we can trade off this accuracy to reduce the number of training steps needed by roughly 10-30%. Our work is practical and easily usable by others -- only one hyperparameter (learning rate) needs tuning, and furthermore, the algorithm is as computationally cheap as the commonly used Adam optimizer

    Convergence rates of sub-sampled Newton methods

    Full text link
    We consider the problem of minimizing a sum of nn functions over a convex parameter set C⊂Rp\mathcal{C} \subset \mathbb{R}^p where n≫p≫1n\gg p\gg 1. In this regime, algorithms which utilize sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems. Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios

    A Stochastic Quasi-Newton Method for Large-Scale Optimization

    Full text link
    The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise

    Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods

    Full text link
    This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval hh based on the noise estimation techniques of Hamming (2012) and Mor\'e and Wild (2011). This noise estimation procedure and the selection of hh are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a function interpolating trust region method are presented.Comment: 26 pages, 9 figure

    Preconditioned Stochastic Gradient Descent

    Full text link
    Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to estimate a preconditioner such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and non-convex optimizations with exact or noisy gradients. When stochastic gradient is used, it can naturally damp the gradient noise to stabilize SGD. Efficient preconditioner estimation methods are developed, and with reasonable simplifications, they are applicable to large scaled problems. Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long term memories.Comment: 13 pages, 9 figures. To appear in IEEE Transactions on Neural Networks and Learning Systems. Supplemental materials on https://sites.google.com/site/lixilinx/home/psg

    Variable Metric Stochastic Approximation Theory

    Full text link
    We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable metric methods including the recently introduced online versions of the BFGS algorithm and its limited-memory LBFGS variant. We also discuss the implications of our results for learning from expert advice.Comment: Correctment of theorem 3.4. from AISTATS 2009 articl

    Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

    Full text link
    The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models

    A fast quasi-Newton-type method for large-scale stochastic optimisation

    Full text link
    During recent years there has been an increased interest in stochastic adaptations of limited memory quasi-Newton methods, which compared to pure gradient-based routines can improve the convergence by incorporating second order information. In this work we propose a direct least-squares approach conceptually similar to the limited memory quasi-Newton methods, but that computes the search direction in a slightly different way. This is achieved in a fast and numerically robust manner by maintaining a Cholesky factor of low dimension. This is combined with a stochastic line search relying upon fulfilment of the Wolfe condition in a backtracking manner, where the step length is adaptively modified with respect to the optimisation progress. We support our new algorithm by providing several theoretical results guaranteeing its performance. The performance is demonstrated on real-world benchmark problems which shows improved results in comparison with already established methods.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0431

    Hybrid optimization and Bayesian inference techniques for a non-smooth radiation detection problem

    Full text link
    In this investigation, we propose several algorithms to recover the location and intensity of a radiation source located in a simulated 250 m x 180 m block in an urban center based on synthetic measurements. Radioactive decay and detection are Poisson random processes, so we employ likelihood functions based on this distribution. Due to the domain geometry and the proposed response model, the negative logarithm of the likelihood is only piecewise continuous differentiable, and it has multiple local minima. To address these difficulties, we investigate three hybrid algorithms comprised of mixed optimization techniques. For global optimization, we consider Simulated Annealing (SA), Particle Swarm (PS) and Genetic Algorithm (GA), which rely solely on objective function evaluations; i.e., they do not evaluate the gradient in the objective function. By employing early stopping criteria for the global optimization methods, a pseudo-optimum point is obtained. This is subsequently utilized as the initial value by the deterministic Implicit Filtering method (IF), which is able to find local extrema in non-smooth functions, to finish the search in a narrow domain. These new hybrid techniques combining global optimization and Implicit Filtering address difficulties associated with the non-smooth response, and their performances are shown to significantly decrease the computational time over the global optimization methods alone. To quantify uncertainties associated with the source location and intensity, we employ the Delayed Rejection Adaptive Metropolis (DRAM) and DiffeRential Evolution Adaptive Metropolis (DREAM) algorithms. Marginal densities of the source properties are obtained, and the means of the chains' compare accurately with the estimates produced by the hybrid algorithms.Comment: 36 pages, 14 figure
    • …
    corecore