1,123 research outputs found
From Averaging to Acceleration, There is Only a Step-size
We show that accelerated gradient descent, averaged gradient descent and the
heavy-ball method for non-strongly-convex problems may be reformulated as
constant parameter second-order difference equation algorithms, where stability
of the system is equivalent to convergence at rate O(1/n 2), where n is the
number of iterations. We provide a detailed analysis of the eigenvalues of the
corresponding linear dynamical system , showing various oscillatory and
non-oscillatory behaviors, together with a sharp stability result with explicit
constants. We also consider the situation where noisy gradients are available,
where we extend our general convergence result, which suggests an alternative
algorithm (i.e., with different step sizes) that exhibits the good aspects of
both averaging and acceleration
Local Component Analysis
Kernel density estimation, a.k.a. Parzen windows, is a popular density
estimation method, which can be used for outlier detection or clustering. With
multivariate data, its performance is heavily reliant on the metric used within
the kernel. Most earlier work has focused on learning only the bandwidth of the
kernel (i.e., a scalar multiplicative factor). In this paper, we propose to
learn a full Euclidean metric through an expectation-minimization (EM)
procedure, which can be seen as an unsupervised counterpart to neighbourhood
component analysis (NCA). In order to avoid overfitting with a fully
nonparametric density estimator in high dimensions, we also consider a
semi-parametric Gaussian-Parzen density model, where some of the variables are
modelled through a jointly Gaussian density, while others are modelled through
Parzen windows. For these two models, EM leads to simple closed-form updates
based on matrix inversions and eigenvalue decompositions. We show empirically
that our method leads to density estimators with higher test-likelihoods than
natural competing methods, and that the metrics may be used within most
unsupervised learning techniques that rely on such metrics, such as spectral
clustering or manifold learning methods. Finally, we present a stochastic
approximation scheme which allows for the use of this method in a large-scale
setting
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose
gradients are only accessible through a stochastic oracle that returns the
gradient at any given point plus a zero-mean finite variance random error. We
present the first algorithm that achieves jointly the optimal prediction error
rates for least-squares regression, both in terms of forgetting of initial
conditions in O(1/n 2), and in terms of dependence on the noise and dimension d
of the problem, as O(d/n). Our new algorithm is based on averaged accelerated
regularized gradient descent, and may also be analyzed through finer
assumptions on initial conditions and the Hessian matrix, leading to
dimension-free quantities that may still be small while the " optimal " terms
above are large. In order to characterize the tightness of these new bounds, we
consider an application to non-parametric regression and use the known lower
bounds on the statistical performance (without computational limits), which
happen to match our bounds obtained from a single pass on the data and thus
show optimality of our algorithm in a wide variety of particular trade-offs
between bias and variance
Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
We consider the problem of optimizing the sum of a smooth convex function and
a non-smooth convex function using proximal-gradient methods, where an error is
present in the calculation of the gradient of the smooth term or in the
proximity operator with respect to the non-smooth term. We show that both the
basic proximal-gradient method and the accelerated proximal-gradient method
achieve the same convergence rate as in the error-free case, provided that the
errors decrease at appropriate rates.Using these rates, we perform as well as
or better than a carefully chosen fixed error level on a set of structured
sparsity problems.Comment: Neural Information Processing Systems (2011
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods
Our goal is to improve variance reducing stochastic methods through better
control variates. We first propose a modification of SVRG which uses the
Hessian to track gradients over time, rather than to recondition, increasing
the correlation of the control variates and leading to faster theoretical
convergence close to the optimum. We then propose accurate and computationally
efficient approximations to the Hessian, both using a diagonal and a low-rank
matrix. Finally, we demonstrate the effectiveness of our method on a wide range
of problems.Comment: 17 pages, 2 figures, 1 tabl
EPRDF’s Nation-Building: tinkering with convictions and pragmatism
The Ethio-Eritrean war (1998-2000) is often considered a turning point in the nationalist discourse of the Ethiopian People’s Revolutionary Democratic Front (EPRDF) and the main cause of the reactivation of a strong Pan-Ethiopian nationalism (here taken as synonymous with Ethiopianness), after the introduction of “ethnic federalism” in 1995. This paper argues that Pan-Ethiopian and “ethnic” nationalism coexisted in TPLF-EPRDF’s nationalism before the 1998-2000 war. As a political and pragmatic tool to grasp and keep power, the “multifaceted” nationalism of the EPRDF was adapted and adjusted to new circumstances. This explains the ease with which Pan-Ethiopianism was reactivated and reinvented from 1998 onwards. In this process, the 2005 general elections and the rise of opposition groups defending a Pan-Ethiopian nationalism also represented an important influence in EPRDF’s nationalist adjustment.A guerra EtiĂłpia-Eritreia (1998-2000) Ă© frequentemente considerada um ponto de viragem no discurso nacionalista da Frente Democrática Revolucionária do Povo EtĂope (EPRDF) e a principal causa da reativação de um forte nacionalismo pan-etĂope (considerado aqui como sinĂłnimo de etiopianidade), apĂłs a introdução do “federalismo Ă©tnico” em 1995. Este artigo argumenta que o nacionalismo pan-etĂope e “étnico” coexistiram no nacionalismo da TPLF-EPRDF antes da guerra de 1998-2000. Como ferramenta polĂtica e pragmática para conquistar e manter o poder, o nacionalismo “multifacetado” da EPRDF foi adaptado e ajustado Ă s novas circunstâncias. Isso explica a fácil reativação e reinvenção do pan-etiopianismo a partir de 1998. Neste processo, as eleições gerais de 2005 e o surgimento de grupos de oposição que defendem um nacionalismo pan-etĂope tambĂ©m representaram uma importante influĂŞncia no ajuste nacionalista da EPRDF
The Power of Dynastic Commitment
We study how, at times of CEO transitions, the identity of the CEO successor shapes labor contracts within family firms. We propose an alternate view of how family management might underperform relative to external management in family firms. The idea developed in this paper is that, in contrast to external professionals, CEOs promoted from within the family not only inherit control of the firm but also inherit a set of implicit contracts that affects their ability to restructure the firm. Consistent with our dynastic commitment hypothesis, we find that family-promoted CEOs are associated with lower turnover of the workforce, lower wage renegotiation, and greater loyalty for the incumbent workforce
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
We propose a new stochastic gradient method for optimizing the sum of a
finite set of smooth functions, where the sum is strongly convex. While
standard stochastic gradient methods converge at sublinear rates for this
problem, the proposed method incorporates a memory of previous gradient values
in order to achieve a linear convergence rate. In a machine learning context,
numerical experiments indicate that the new algorithm can dramatically
outperform standard algorithms, both in terms of optimizing the training error
and reducing the test error quickly.Comment: The notable changes over the current version: - worked example of
convergence rates showing SAG can be faster than first-order methods -
pointing out that the storage cost is O(n) for linear models - the
more-stable line-search - comparison to additional optimal SG methods -
comparison to rates of coordinate descent methods in quadratic cas
- …