158 research outputs found
Stochastic Frank-Wolfe Methods for Nonconvex Optimization
We study Frank-Wolfe methods for nonconvex stochastic and finite-sum
optimization problems. Frank-Wolfe methods (in the convex case) have gained
tremendous recent interest in machine learning and optimization communities due
to their projection-free property and their ability to exploit structured
constraints. However, our understanding of these algorithms in the nonconvex
setting is fairly limited. In this paper, we propose nonconvex stochastic
Frank-Wolfe methods and analyze their convergence properties. For objective
functions that decompose into a finite-sum, we leverage ideas from variance
reduction techniques for convex optimization to obtain new variance reduced
nonconvex Frank-Wolfe methods that have provably faster convergence than the
classical Frank-Wolfe method. Finally, we show that the faster convergence
rates of our variance reduced methods also translate into improved convergence
rates for the stochastic setting
Catalyst Acceleration for Gradient-Based Non-Convex Optimization
We introduce a generic scheme to solve nonconvex optimization problems using
gradient-based algorithms originally designed for minimizing convex functions.
Even though these methods may originally require convexity to operate, the
proposed approach allows one to use them on weakly convex objectives, which
covers a large class of non-convex functions typically appearing in machine
learning and signal processing. In general, the scheme is guaranteed to produce
a stationary point with a worst-case efficiency typical of first-order methods,
and when the objective turns out to be convex, it automatically accelerates in
the sense of Nesterov and achieves near-optimal convergence rate in function
values. These properties are achieved without assuming any knowledge about the
convexity of the objective, by automatically adapting to the unknown weak
convexity constant. We conclude the paper by showing promising experimental
results obtained by applying our approach to incremental algorithms such as
SVRG and SAGA for sparse matrix factorization and for learning neural networks
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to
work favorably in solving large-scale learning problems. Despite the remarkable
success, the stochastic gradient complexity of SVRG-type algorithms usually
scales linearly with data size and thus could still be expensive for huge data.
To address this deficiency, we propose a hybrid stochastic-deterministic
minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems
that enjoys provably improved data-size-independent complexity guarantees. More
precisely, for quadratic loss of components, we prove that
HSDMPG can attain an -optimization-error
within
stochastic gradient evaluations, where is condition number. For
generic strongly convex loss functions, we prove a nearly identical complexity
bound though at the cost of slightly increased logarithmic factors. For
large-scale learning problems, our complexity bounds are superior to those of
the prior state-of-the-art SVRG algorithms with or without dependence on data
size. Particularly, in the case of
which is at the order of intrinsic excess error bound of a learning model and
thus sufficient for generalization, the stochastic gradient complexity bounds
of HSDMPG for quadratic and generic loss functions are respectively
and , which to our best knowledge, for the first time
achieve optimal generalization in less than a single pass over data. Extensive
numerical results demonstrate the computational advantages of our algorithm
over the prior ones
On the fast convergence of minibatch heavy ball momentum
Simple stochastic momentum methods are widely used in machine learning
optimization, but their good practical performance is at odds with an absence
of theoretical guarantees of acceleration in the literature. In this work, we
aim to close the gap between theory and practice by showing that stochastic
heavy ball momentum retains the fast linear rate of (deterministic) heavy ball
momentum on quadratic optimization problems, at least when minibatching with a
sufficiently large batch size. The algorithm we study can be interpreted as an
accelerated randomized Kaczmarz algorithm with minibatching and heavy ball
momentum. The analysis relies on carefully decomposing the momentum transition
matrix, and using new spectral norm concentration bounds for products of
independent random matrices. We provide numerical illustrations demonstrating
that our bounds are reasonably sharp
The Practicality of Stochastic Optimization in Imaging Inverse Problems
In this work we investigate the practicality of stochastic gradient descent
and recently introduced variants with variance-reduction techniques in imaging
inverse problems. Such algorithms have been shown in the machine learning
literature to have optimal complexities in theory, and provide great
improvement empirically over the deterministic gradient methods. Surprisingly,
in some tasks such as image deblurring, many of such methods fail to converge
faster than the accelerated deterministic gradient methods, even in terms of
epoch counts. We investigate this phenomenon and propose a theory-inspired
mechanism for the practitioners to efficiently characterize whether it is
beneficial for an inverse problem to be solved by stochastic optimization
techniques or not. Using standard tools in numerical linear algebra, we derive
conditions on the spectral structure of the inverse problem for being a
suitable application of stochastic gradient methods. Particularly, we show
that, for an imaging inverse problem, if and only if its Hessain matrix has a
fast-decaying eigenspectrum, then the stochastic gradient methods can be more
advantageous than deterministic methods for solving such a problem. Our results
also provide guidance on choosing appropriately the partition minibatch
schemes, showing that a good minibatch scheme typically has relatively low
correlation within each of the minibatches. Finally, we propose an accelerated
primal-dual SGD algorithm in order to tackle another key bottleneck of
stochastic optimization which is the heavy computation of proximal operators.
The proposed method has fast convergence rate in practice, and is able to
efficiently handle non-smooth regularization terms which are coupled with
linear operators
- …