1,123 research outputs found

    From Averaging to Acceleration, There is Only a Step-size

    Get PDF
    We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n 2), where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system , showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration

    Local Component Analysis

    Get PDF
    Kernel density estimation, a.k.a. Parzen windows, is a popular density estimation method, which can be used for outlier detection or clustering. With multivariate data, its performance is heavily reliant on the metric used within the kernel. Most earlier work has focused on learning only the bandwidth of the kernel (i.e., a scalar multiplicative factor). In this paper, we propose to learn a full Euclidean metric through an expectation-minimization (EM) procedure, which can be seen as an unsupervised counterpart to neighbourhood component analysis (NCA). In order to avoid overfitting with a fully nonparametric density estimator in high dimensions, we also consider a semi-parametric Gaussian-Parzen density model, where some of the variables are modelled through a jointly Gaussian density, while others are modelled through Parzen windows. For these two models, EM leads to simple closed-form updates based on matrix inversions and eigenvalue decompositions. We show empirically that our method leads to density estimators with higher test-likelihoods than natural competing methods, and that the metrics may be used within most unsupervised learning techniques that rely on such metrics, such as spectral clustering or manifold learning methods. Finally, we present a stochastic approximation scheme which allows for the use of this method in a large-scale setting

    Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

    Get PDF
    We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance

    Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

    Get PDF
    We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates.Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems.Comment: Neural Information Processing Systems (2011

    Minimizing Finite Sums with the Stochastic Average Gradient

    Get PDF
    We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated literature follow and discussion of subsequent work, additional Lemma showing the validity of one of the formulas, somewhat simplified presentation of Lyapunov bound, included code needed for checking proofs rather than the polynomials generated by the code, added error regions to the numerical experiment

    Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

    Full text link
    Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximations to the Hessian, both using a diagonal and a low-rank matrix. Finally, we demonstrate the effectiveness of our method on a wide range of problems.Comment: 17 pages, 2 figures, 1 tabl

    EPRDF’s Nation-Building: tinkering with convictions and pragmatism

    Get PDF
    The Ethio-Eritrean war (1998-2000) is often considered a turning point in the nationalist discourse of the Ethiopian People’s Revolutionary Democratic Front (EPRDF) and the main cause of the reactivation of a strong Pan-Ethiopian nationalism (here taken as synonymous with Ethiopianness), after the introduction of “ethnic federalism” in 1995. This paper argues that Pan-Ethiopian and “ethnic” nationalism coexisted in TPLF-EPRDF’s nationalism before the 1998-2000 war. As a political and pragmatic tool to grasp and keep power, the “multifaceted” nationalism of the EPRDF was adapted and adjusted to new circumstances. This explains the ease with which Pan-Ethiopianism was reactivated and reinvented from 1998 onwards. In this process, the 2005 general elections and the rise of opposition groups defending a Pan-Ethiopian nationalism also represented an important influence in EPRDF’s nationalist adjustment.A guerra Etiópia-Eritreia (1998-2000) é frequentemente considerada um ponto de viragem no discurso nacionalista da Frente Democrática Revolucionária do Povo Etíope (EPRDF) e a principal causa da reativação de um forte nacionalismo pan-etíope (considerado aqui como sinónimo de etiopianidade), após a introdução do “federalismo étnico” em 1995. Este artigo argumenta que o nacionalismo pan-etíope e “étnico” coexistiram no nacionalismo da TPLF-EPRDF antes da guerra de 1998-2000. Como ferramenta política e pragmática para conquistar e manter o poder, o nacionalismo “multifacetado” da EPRDF foi adaptado e ajustado às novas circunstâncias. Isso explica a fácil reativação e reinvenção do pan-etiopianismo a partir de 1998. Neste processo, as eleições gerais de 2005 e o surgimento de grupos de oposição que defendem um nacionalismo pan-etíope também representaram uma importante influência no ajuste nacionalista da EPRDF

    The Power of Dynastic Commitment

    Get PDF
    We study how, at times of CEO transitions, the identity of the CEO successor shapes labor contracts within family firms. We propose an alternate view of how family management might underperform relative to external management in family firms. The idea developed in this paper is that, in contrast to external professionals, CEOs promoted from within the family not only inherit control of the firm but also inherit a set of implicit contracts that affects their ability to restructure the firm. Consistent with our dynastic commitment hypothesis, we find that family-promoted CEOs are associated with lower turnover of the workforce, lower wage renegotiation, and greater loyalty for the incumbent workforce

    A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

    Get PDF
    We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly.Comment: The notable changes over the current version: - worked example of convergence rates showing SAG can be faster than first-order methods - pointing out that the storage cost is O(n) for linear models - the more-stable line-search - comparison to additional optimal SG methods - comparison to rates of coordinate descent methods in quadratic cas
    • …
    corecore