7,698 research outputs found
A Variational Perspective on Accelerated Methods in Optimization
Accelerated gradient methods play a central role in optimization, achieving
optimal rates in many settings. While many generalizations and extensions of
Nesterov's original acceleration method have been proposed, it is not yet clear
what is the natural scope of the acceleration concept. In this paper, we study
accelerated methods from a continuous-time perspective. We show that there is a
Lagrangian functional that we call the \emph{Bregman Lagrangian} which
generates a large class of accelerated methods in continuous time, including
(but not limited to) accelerated gradient descent, its non-Euclidean extension,
and accelerated higher-order gradient methods. We show that the continuous-time
limit of all of these methods correspond to traveling the same curve in
spacetime at different speeds. From this perspective, Nesterov's technique and
many of its generalizations can be viewed as a systematic way to go from the
continuous-time curves generated by the Bregman Lagrangian to a family of
discrete-time accelerated algorithms.Comment: 38 pages. Subsumes an earlier working draft arXiv:1509.0361
Discrete variational calculus for accelerated optimization
Many of the new developments in machine learning are connected with gradient-based optimization methods. Recently, these methods have been studied using a variational perspective (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we introduce variational integrators (Marsden and West, 2001) which allow us to derive different methods for optimization. Using both Hamilton’s and Lagrange-d’Alembert’s principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the fibers. Several experiments exemplify the result
A Discrete Variational Derivation of Accelerated Methods in Optimization
Many of the new developments in machine learning are connected with
gradient-based optimization methods. Recently, these methods have been studied
using a variational perspective. This has opened up the possibility of
introducing variational and symplectic methods using geometric integration. In
particular, in this paper, we introduce variational integrators which allow us
to derive different methods for optimization. Using both, Hamilton's and
Lagrange-d'Alembert's principle, we derive two families of respective
optimization methods in one-to-one correspondence that generalize Polyak's
heavy ball and the well known Nesterov accelerated gradient method, the second
of which mimics the behavior of the first reducing the oscillations of
classical momentum methods. However, since the systems considered are
explicitly time-dependent, the preservation of symplecticity of autonomous
systems occurs here solely on the fibers. Several experiments exemplify the
result.Comment: 28 pages, 11 figure
On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Optimization
Extrapolation is a well-known technique for solving convex optimization and
variational inequalities and recently attracts some attention for non-convex
optimization. Several recent works have empirically shown its success in some
machine learning tasks. However, it has not been analyzed for non-convex
minimization and there still remains a gap between the theory and the practice.
In this paper, we analyze gradient descent and stochastic gradient descent with
extrapolation for finding an approximate first-order stationary point in smooth
non-convex optimization problems. Our convergence upper bounds show that the
algorithms with extrapolation can be accelerated than without extrapolation
- …