1,097 research outputs found

    Stochastic Chebyshev Gradient Descent for Spectral Optimization

    Full text link
    A large class of machine learning techniques requires the solution of optimization problems involving spectral functions of parametric matrices, e.g. log-determinant and nuclear norm. Unfortunately, computing the gradient of a spectral function is generally of cubic complexity, as such gradient descent methods are rather expensive for optimizing objectives involving the spectral function. Thus, one naturally turns to stochastic gradient methods in hope that they will provide a way to reduce or altogether avoid the computation of full gradients. However, here a new challenge appears: there is no straightforward way to compute unbiased stochastic gradients for spectral functions. In this paper, we develop unbiased stochastic gradients for spectral-sums, an important subclass of spectral functions. Our unbiased stochastic gradients are based on combining randomized trace estimators with stochastic truncation of the Chebyshev expansions. A careful design of the truncation distribution allows us to offer distributions that are variance-optimal, which is crucial for fast and stable convergence of stochastic gradient methods. We further leverage our proposed stochastic gradients to devise stochastic methods for objective functions involving spectral-sums, and rigorously analyze their convergence rate. The utility of our methods is demonstrated in numerical experiments

    Nonlinear Acceleration of Momentum and Primal-Dual Algorithms

    Full text link
    We describe convergence acceleration schemes for multistep optimization algorithms. The extrapolated solution is written as a nonlinear average of the iterates produced by the original optimization method. Our analysis does not need the underlying fixed-point operator to be symmetric, hence handles e.g. algorithms with momentum terms such as Nesterov's accelerated method, or primal-dual methods. The weights are computed via a simple linear system and we analyze performance in both online and offline modes. We use Crouzeix's conjecture to show that acceleration performance is controlled by the solution of a Chebyshev problem on the numerical range of a non-symmetric operator modeling the behavior of iterates near the optimum. Numerical experiments are detailed on logistic regression problems

    Faster randomized block Kaczmarz algorithms

    Full text link
    The Kaczmarz algorithm is a simple iterative scheme for solving consistent linear systems. At each step, the method projects the current iterate onto the solution space of a single constraint. Hence, it requires very low cost per iteration and storage, and it has a linear rate of convergence. Distributed implementations of Kaczmarz have become, in recent years, the de facto architectural choice for large-scale linear systems. Therefore, in this paper we develop a family of randomized block Kaczmarz algorithms that uses at each step a subset of the constraints and extrapolated stepsizes, and can be deployed on distributed computing units. Our approach is based on several new ideas and tools, including stochastic selection rule for the blocks of rows, stochastic conditioning of the linear system, and novel strategies for designing extrapolated stepsizes. We prove that randomized block Kaczmarz algorithm converges linearly in expectation, with a rate depending on the geometric properties of the matrix and its submatrices and on the size of the blocks. Our convergence analysis reveals that the algorithm is most effective when it is given a good sampling of the rows into well-conditioned blocks. Besides providing a general framework for the design and analysis of randomized block Kaczmarz methods, our results resolve an open problem in the literature related to the theoretical understanding of observed practical efficiency of extrapolated block Kaczmarz methods.Comment: 20 page

    Stability and Convergence Trade-off of Iterative Optimization Algorithms

    Full text link
    The overall performance or expected excess risk of an iterative machine learning algorithm can be decomposed into training error and generalization error. While the former is controlled by its convergence analysis, the latter can be tightly handled by algorithmic stability. The machine learning community has a rich history investigating convergence and stability separately. However, the question about the trade-off between these two quantities remains open. In this paper, we show that for any iterative algorithm at any iteration, the overall performance is lower bounded by the minimax statistical error over an appropriately chosen loss function class. This implies an important trade-off between convergence and stability of the algorithm -- a faster converging algorithm has to be less stable, and vice versa. As a direct consequence of this fundamental tradeoff, new convergence lower bounds can be derived for classes of algorithms constrained with different stability bounds. In particular, when the loss function is convex (or strongly convex) and smooth, we discuss the stability upper bounds of gradient descent (GD) and stochastic gradient descent and their variants with decreasing step sizes. For Nesterov's accelerated gradient descent (NAG) and heavy ball method (HB), we provide stability upper bounds for the quadratic loss function. Applying existing stability upper bounds for the gradient methods in our trade-off framework, we obtain lower bounds matching the well-established convergence upper bounds up to constants for these algorithms and conjecture similar lower bounds for NAG and HB. Finally, we numerically demonstrate the tightness of our stability bounds in terms of exponents in the rate and also illustrate via a simulated logistic regression problem that our stability bounds reflect the generalization errors better than the simple uniform convergence bounds for GD and NAG.Comment: 45 pages, 7 figure

    NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization

    Full text link
    Accelerated gradient (AG) methods are breakthroughs in convex optimization, improving the convergence rate of the gradient descent method for optimization with smooth functions. However, the analysis of AG methods for non-convex optimization is still limited. It remains an open question whether AG methods from convex optimization can accelerate the convergence of the gradient descent method for finding local minimum of non-convex optimization problems. This paper provides an affirmative answer to this question. In particular, we analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method and Nesterov's Accelerated Gradient method) for extracting the negative curvature from random noise, which is central to escaping from saddle points. By leveraging the proposed AG methods for extracting the negative curvature, we present a new AG algorithm with double loops for non-convex optimization~\footnote{this is in contrast to a single-loop AG algorithm proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the Nesterov's AG method for non-convex optimization and appeared online on November 29, 2017. However, we emphasize that our work is an independent work, which is inspired by our earlier work~\citep{NEON17} and is based on a different novel analysis.}, which converges to second-order stationary point \x such that \|\nabla f(\x)\|\leq \epsilon and \nabla^2 f(\x)\geq -\sqrt{\epsilon} I with O~(1/ϵ1.75)\widetilde O(1/\epsilon^{1.75}) iteration complexity, improving that of gradient descent method by a factor of ϵ−0.25\epsilon^{-0.25} and matching the best iteration complexity of second-order Hessian-free methods for non-convex optimization.Comment: The main result is merged into our manuscript "First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time" (arXiv:1711.01944

    Exponential Family Estimation via Adversarial Dynamics Embedding

    Full text link
    We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. We exploit the primal-dual view of the MLE with a kinetics augmented model to obtain an estimate associated with an adversarial dual sampler. To represent this sampler, we introduce a novel neural architecture, dynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while enabling tractable entropy estimation for the augmented model. By learning both a dual sampler and the primal model simultaneously, and sharing parameters between them, we obviate the requirement to design a separate sampling procedure once the model has been trained, leading to more effective learning. We show that many existing estimators, such as contrastive divergence, pseudo/composite-likelihood, score matching, minimum Stein discrepancy estimator, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow, are special cases of the proposed approach, each expressed by a different (fixed) dual sampler. An empirical investigation shows that adapting the sampler during MLE can significantly improve on state-of-the-art estimators.Comment: Appearing in NeurIPS 2019 Vancouver, Canada; a preliminary version published in NeurIPS2018 Bayesian Deep Learning Worksho

    Inexact Newton Methods for Stochastic Nonconvex Optimization with Applications to Neural Network Training

    Full text link
    We study stochastic inexact Newton methods and consider their application in nonconvex settings. Building on the work of [R. Bollapragada, R. H. Byrd, and J. Nocedal, IMA Journal of Numerical Analysis, 39 (2018), pp. 545--578] we derive bounds for convergence rates in expected value for stochastic low rank Newton methods, and stochastic inexact Newton Krylov methods. These bounds quantify the errors incurred in subsampling the Hessian and gradient, as well as in approximating the Newton linear solve, and in choosing regularization and step length parameters. We deploy these methods in training convolutional autoencoders for the MNIST and CIFAR10 data sets. Numerical results demonstrate that, relative to first order methods, these stochastic inexact Newton methods often converge faster, are more cost-effective, and generalize better

    Neon2: Finding Local Minima via First-Order Oracles

    Full text link
    We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.Comment: version 2 and 3 improve writin

    Noisy Accelerated Power Method for Eigenproblems with Applications

    Full text link
    This paper introduces an efficient algorithm for finding the dominant generalized eigenvectors of a pair of symmetric matrices. Combining tools from approximation theory and convex optimization, we develop a simple scalable algorithm with strong theoretical performance guarantees. More precisely, the algorithm retains the simplicity of the well-known power method but enjoys the asymptotic iteration complexity of the powerful Lanczos method. Unlike these classic techniques, our algorithm is designed to decompose the overall problem into a series of subproblems that only need to be solved approximately. The combination of good initializations, fast iterative solvers, and appropriate error control in solving the subproblems lead to a linear running time in the input sizes compared to the superlinear time for the traditional methods. The improved running time immediately offers acceleration for several applications. As an example, we demonstrate how the proposed algorithm can be used to accelerate canonical correlation analysis, which is a fundamental statistical tool for learning of a low-dimensional representation of high-dimensional objects. Numerical experiments on real-world data sets confirm that our approach yields significant improvements over the current state-of-the-art.Comment: Accepted for publication in the IEEE Transaction on Signal Processin

    Acceleration via Fractal Learning Rate Schedules

    Full text link
    In practical applications of iterative first-order optimization, the learning rate schedule remains notoriously difficult to understand and expensive to tune. We demonstrate the presence of these subtleties even in the innocuous case when the objective is a convex quadratic. We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and show that the problem of mitigating instability leads to a fractal ordering of step sizes. We provide some experiments to challenge conventional beliefs about stable learning rates in deep learning: the fractal schedule enables training to converge with locally unstable updates which make negative progress on the objective.Comment: v2: revisions for ICML 202
    • …
    corecore