1,097 research outputs found
Stochastic Chebyshev Gradient Descent for Spectral Optimization
A large class of machine learning techniques requires the solution of
optimization problems involving spectral functions of parametric matrices, e.g.
log-determinant and nuclear norm. Unfortunately, computing the gradient of a
spectral function is generally of cubic complexity, as such gradient descent
methods are rather expensive for optimizing objectives involving the spectral
function. Thus, one naturally turns to stochastic gradient methods in hope that
they will provide a way to reduce or altogether avoid the computation of full
gradients. However, here a new challenge appears: there is no straightforward
way to compute unbiased stochastic gradients for spectral functions. In this
paper, we develop unbiased stochastic gradients for spectral-sums, an important
subclass of spectral functions. Our unbiased stochastic gradients are based on
combining randomized trace estimators with stochastic truncation of the
Chebyshev expansions. A careful design of the truncation distribution allows us
to offer distributions that are variance-optimal, which is crucial for fast and
stable convergence of stochastic gradient methods. We further leverage our
proposed stochastic gradients to devise stochastic methods for objective
functions involving spectral-sums, and rigorously analyze their convergence
rate. The utility of our methods is demonstrated in numerical experiments
Nonlinear Acceleration of Momentum and Primal-Dual Algorithms
We describe convergence acceleration schemes for multistep optimization
algorithms. The extrapolated solution is written as a nonlinear average of the
iterates produced by the original optimization method. Our analysis does not
need the underlying fixed-point operator to be symmetric, hence handles e.g.
algorithms with momentum terms such as Nesterov's accelerated method, or
primal-dual methods. The weights are computed via a simple linear system and we
analyze performance in both online and offline modes. We use Crouzeix's
conjecture to show that acceleration performance is controlled by the solution
of a Chebyshev problem on the numerical range of a non-symmetric operator
modeling the behavior of iterates near the optimum. Numerical experiments are
detailed on logistic regression problems
Faster randomized block Kaczmarz algorithms
The Kaczmarz algorithm is a simple iterative scheme for solving consistent
linear systems. At each step, the method projects the current iterate onto the
solution space of a single constraint. Hence, it requires very low cost per
iteration and storage, and it has a linear rate of convergence. Distributed
implementations of Kaczmarz have become, in recent years, the de facto
architectural choice for large-scale linear systems. Therefore, in this paper
we develop a family of randomized block Kaczmarz algorithms that uses at each
step a subset of the constraints and extrapolated stepsizes, and can be
deployed on distributed computing units. Our approach is based on several new
ideas and tools, including stochastic selection rule for the blocks of rows,
stochastic conditioning of the linear system, and novel strategies for
designing extrapolated stepsizes. We prove that randomized block Kaczmarz
algorithm converges linearly in expectation, with a rate depending on the
geometric properties of the matrix and its submatrices and on the size of the
blocks. Our convergence analysis reveals that the algorithm is most effective
when it is given a good sampling of the rows into well-conditioned blocks.
Besides providing a general framework for the design and analysis of randomized
block Kaczmarz methods, our results resolve an open problem in the literature
related to the theoretical understanding of observed practical efficiency of
extrapolated block Kaczmarz methods.Comment: 20 page
Stability and Convergence Trade-off of Iterative Optimization Algorithms
The overall performance or expected excess risk of an iterative machine
learning algorithm can be decomposed into training error and generalization
error. While the former is controlled by its convergence analysis, the latter
can be tightly handled by algorithmic stability. The machine learning community
has a rich history investigating convergence and stability separately. However,
the question about the trade-off between these two quantities remains open. In
this paper, we show that for any iterative algorithm at any iteration, the
overall performance is lower bounded by the minimax statistical error over an
appropriately chosen loss function class. This implies an important trade-off
between convergence and stability of the algorithm -- a faster converging
algorithm has to be less stable, and vice versa. As a direct consequence of
this fundamental tradeoff, new convergence lower bounds can be derived for
classes of algorithms constrained with different stability bounds. In
particular, when the loss function is convex (or strongly convex) and smooth,
we discuss the stability upper bounds of gradient descent (GD) and stochastic
gradient descent and their variants with decreasing step sizes. For Nesterov's
accelerated gradient descent (NAG) and heavy ball method (HB), we provide
stability upper bounds for the quadratic loss function. Applying existing
stability upper bounds for the gradient methods in our trade-off framework, we
obtain lower bounds matching the well-established convergence upper bounds up
to constants for these algorithms and conjecture similar lower bounds for NAG
and HB. Finally, we numerically demonstrate the tightness of our stability
bounds in terms of exponents in the rate and also illustrate via a simulated
logistic regression problem that our stability bounds reflect the
generalization errors better than the simple uniform convergence bounds for GD
and NAG.Comment: 45 pages, 7 figure
NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization
Accelerated gradient (AG) methods are breakthroughs in convex optimization,
improving the convergence rate of the gradient descent method for optimization
with smooth functions. However, the analysis of AG methods for non-convex
optimization is still limited. It remains an open question whether AG methods
from convex optimization can accelerate the convergence of the gradient descent
method for finding local minimum of non-convex optimization problems. This
paper provides an affirmative answer to this question. In particular, we
analyze two renowned variants of AG methods (namely Polyak's Heavy Ball method
and Nesterov's Accelerated Gradient method) for extracting the negative
curvature from random noise, which is central to escaping from saddle points.
By leveraging the proposed AG methods for extracting the negative curvature, we
present a new AG algorithm with double loops for non-convex
optimization~\footnote{this is in contrast to a single-loop AG algorithm
proposed in a recent manuscript~\citep{AGNON}, which directly analyzed the
Nesterov's AG method for non-convex optimization and appeared online on
November 29, 2017. However, we emphasize that our work is an independent work,
which is inspired by our earlier work~\citep{NEON17} and is based on a
different novel analysis.}, which converges to second-order stationary point
\x such that \|\nabla f(\x)\|\leq \epsilon and \nabla^2 f(\x)\geq
-\sqrt{\epsilon} I with iteration
complexity, improving that of gradient descent method by a factor of
and matching the best iteration complexity of second-order
Hessian-free methods for non-convex optimization.Comment: The main result is merged into our manuscript "First-order Stochastic
Algorithms for Escaping From Saddle Points in Almost Linear Time"
(arXiv:1711.01944
Exponential Family Estimation via Adversarial Dynamics Embedding
We present an efficient algorithm for maximum likelihood estimation (MLE) of
exponential family models, with a general parametrization of the energy
function that includes neural networks. We exploit the primal-dual view of the
MLE with a kinetics augmented model to obtain an estimate associated with an
adversarial dual sampler. To represent this sampler, we introduce a novel
neural architecture, dynamics embedding, that generalizes Hamiltonian
Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while
enabling tractable entropy estimation for the augmented model. By learning both
a dual sampler and the primal model simultaneously, and sharing parameters
between them, we obviate the requirement to design a separate sampling
procedure once the model has been trained, leading to more effective learning.
We show that many existing estimators, such as contrastive divergence,
pseudo/composite-likelihood, score matching, minimum Stein discrepancy
estimator, non-local contrastive objectives, noise-contrastive estimation, and
minimum probability flow, are special cases of the proposed approach, each
expressed by a different (fixed) dual sampler. An empirical investigation shows
that adapting the sampler during MLE can significantly improve on
state-of-the-art estimators.Comment: Appearing in NeurIPS 2019 Vancouver, Canada; a preliminary version
published in NeurIPS2018 Bayesian Deep Learning Worksho
Inexact Newton Methods for Stochastic Nonconvex Optimization with Applications to Neural Network Training
We study stochastic inexact Newton methods and consider their application in
nonconvex settings. Building on the work of [R. Bollapragada, R. H. Byrd, and
J. Nocedal, IMA Journal of Numerical
Analysis, 39 (2018), pp. 545--578] we derive bounds for convergence rates in
expected value for stochastic low rank Newton methods, and stochastic inexact
Newton Krylov methods. These bounds quantify the errors incurred in subsampling
the Hessian and gradient, as well as in approximating the Newton linear solve,
and in choosing regularization and step length parameters. We deploy these
methods in training convolutional autoencoders for the MNIST and CIFAR10 data
sets. Numerical results demonstrate that, relative to first order methods,
these stochastic inexact Newton methods often converge faster, are more
cost-effective, and generalize better
Neon2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an
stationary-point finding algorithm into an local-minimum finding one, and (2)
replace the Hessian-vector product computations with only gradient
computations. It works both in the stochastic and the deterministic settings,
without hurting the algorithm's performance.
As applications, our reduction turns Natasha2 into a first-order method
without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into
algorithms finding approximate local minima, outperforming some best known
results.Comment: version 2 and 3 improve writin
Noisy Accelerated Power Method for Eigenproblems with Applications
This paper introduces an efficient algorithm for finding the dominant
generalized eigenvectors of a pair of symmetric matrices. Combining tools from
approximation theory and convex optimization, we develop a simple scalable
algorithm with strong theoretical performance guarantees. More precisely, the
algorithm retains the simplicity of the well-known power method but enjoys the
asymptotic iteration complexity of the powerful Lanczos method. Unlike these
classic techniques, our algorithm is designed to decompose the overall problem
into a series of subproblems that only need to be solved approximately. The
combination of good initializations, fast iterative solvers, and appropriate
error control in solving the subproblems lead to a linear running time in the
input sizes compared to the superlinear time for the traditional methods. The
improved running time immediately offers acceleration for several applications.
As an example, we demonstrate how the proposed algorithm can be used to
accelerate canonical correlation analysis, which is a fundamental statistical
tool for learning of a low-dimensional representation of high-dimensional
objects. Numerical experiments on real-world data sets confirm that our
approach yields significant improvements over the current state-of-the-art.Comment: Accepted for publication in the IEEE Transaction on Signal Processin
Acceleration via Fractal Learning Rate Schedules
In practical applications of iterative first-order optimization, the learning
rate schedule remains notoriously difficult to understand and expensive to
tune. We demonstrate the presence of these subtleties even in the innocuous
case when the objective is a convex quadratic. We reinterpret an iterative
algorithm from the numerical analysis literature as what we call the Chebyshev
learning rate schedule for accelerating vanilla gradient descent, and show that
the problem of mitigating instability leads to a fractal ordering of step
sizes. We provide some experiments to challenge conventional beliefs about
stable learning rates in deep learning: the fractal schedule enables training
to converge with locally unstable updates which make negative progress on the
objective.Comment: v2: revisions for ICML 202
- …