289 research outputs found
Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition
In this paper, a new theory is developed for first-order stochastic convex
optimization, showing that the global convergence rate is sufficiently
quantified by a local growth rate of the objective function in a neighborhood
of the optimal solutions. In particular, if the objective function in the -sublevel set grows as fast as , where represents the closest optimal
solution to and quantifies the local growth rate,
the iteration complexity of first-order stochastic optimization for achieving
an -optimal solution can be ,
which is optimal at most up to a logarithmic factor. To achieve the faster
global convergence, we develop two different accelerated stochastic subgradient
methods by iteratively solving the original problem approximately in a local
region around a historical solution with the size of the local region gradually
decreasing as the solution approaches the optimal set. Besides the theoretical
improvements, this work also includes new contributions towards making the
proposed algorithms practical: (i) we present practical variants of accelerated
stochastic subgradient methods that can run without the knowledge of
multiplicative growth constant and even the growth rate ; (ii) we
consider a broad family of problems in machine learning to demonstrate that the
proposed algorithms enjoy faster convergence than traditional stochastic
subgradient method. We also characterize the complexity of the proposed
algorithms for ensuring the gradient is small without the smoothness
assumption
Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions
Error bound conditions (EBC) are properties that characterize the growth of
an objective function when a point is moved away from the optimal set. They
have recently received increasing attention in the field of optimization for
developing optimization algorithms with fast convergence. However, the studies
of EBC in statistical learning are hitherto still limited. The main
contributions of this paper are two-fold. First, we develop fast and
intermediate rates of empirical risk minimization (ERM) under EBC for risk
minimization with Lipschitz continuous, and smooth convex random functions.
Second, we establish fast and intermediate rates of an efficient stochastic
approximation (SA) algorithm for risk minimization with Lipschitz continuous
random functions, which requires only one pass of samples and adapts to
EBC. For both approaches, the convergence rates span a full spectrum between
and depending on the power
constant in EBC, and could be even faster than in special cases for
ERM. Moreover, these convergence rates are automatically adaptive without using
any knowledge of EBC. Overall, this work not only strengthens the understanding
of ERM for statistical learning but also brings new fast stochastic algorithms
for solving a broad range of statistical learning problems
Faster Subgradient Methods for Functions with H\"olderian Growth
The purpose of this manuscript is to derive new convergence results for
several subgradient methods applied to minimizing nonsmooth convex functions
with H\"olderian growth. The growth condition is satisfied in many applications
and includes functions with quadratic growth and weakly sharp minima as special
cases. To this end there are three main contributions. First, for a constant
and sufficiently small stepsize, we show that the subgradient method achieves
linear convergence up to a certain region including the optimal set, with error
of the order of the stepsize. Second, if appropriate problem parameters are
known, we derive a decaying stepsize which obtains a much faster convergence
rate than is suggested by the classical result for the
subgradient method. Thirdly we develop a novel "descending stairs" stepsize
which obtains this faster convergence rate and also obtains linear convergence
for the special case of weakly sharp functions. We also develop an adaptive
variant of the "descending stairs" stepsize which achieves the same convergence
rate without requiring an error bound constant which is difficult to estimate
in practice.Comment: 50 pages. First revised version (under submission to Math
Programming
Learn-and-Adapt Stochastic Dual Gradients for Network Resource Allocation
Network resource allocation shows revived popularity in the era of data
deluge and information explosion. Existing stochastic optimization approaches
fall short in attaining a desirable cost-delay tradeoff. Recognizing the
central role of Lagrange multipliers in network resource allocation, a novel
learn-and-adapt stochastic dual gradient (LA-SDG) method is developed in this
paper to learn the sample-optimal Lagrange multiplier from historical data, and
accordingly adapt the upcoming resource allocation strategy. Remarkably, LA-SDG
only requires just an extra sample (gradient) evaluation relative to the
celebrated stochastic dual gradient (SDG) method. LA-SDG can be interpreted as
a foresighted learning scheme with an eye on the future, or, a modified
heavy-ball iteration from an optimization viewpoint. It is established - both
theoretically and empirically - that LA-SDG markedly improves the cost-delay
tradeoff over state-of-the-art allocation schemes
Stochastic algorithms with geometric step decay converge linearly on sharp functions
Stochastic (sub)gradient methods require step size schedule tuning to perform
well in practice. Classical tuning strategies decay the step size polynomially
and lead to optimal sublinear rates on (strongly) convex problems. An
alternative schedule, popular in nonconvex optimization, is called
\emph{geometric step decay} and proceeds by halving the step size after every
few epochs. In recent work, geometric step decay was shown to improve
exponentially upon classical sublinear rates for the class of \emph{sharp}
convex functions. In this work, we ask whether geometric step decay similarly
improves stochastic algorithms for the class of sharp nonconvex problems. Such
losses feature in modern statistical recovery problems and lead to a new
challenge not present in the convex setting: the region of convergence is
local, so one must bound the probability of escape. Our main result shows that
for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a
geometric step decay schedule endows well-known algorithms with a local linear
rate of convergence to global minimizers. This guarantee applies to the
stochastic projected subgradient, proximal point, and prox-linear algorithms.
As an application of our main result, we analyze two statistical recovery
tasks---phase retrieval and blind deconvolution---and match the best known
guarantees under Gaussian measurement models and establish new guarantees under
heavy-tailed distributions
SUCAG: Stochastic Unbiased Curvature-aided Gradient Method for Distributed Optimization
We propose and analyze a new stochastic gradient method, which we call
Stochastic Unbiased Curvature-aided Gradient (SUCAG), for finite sum
optimization problems. SUCAG constitutes an unbiased total gradient tracking
technique that uses Hessian information to accelerate con- vergence. We analyze
our method under the general asynchronous model of computation, in which each
function is selected infinitely often with possibly unbounded (but sublinear)
delay. For strongly convex problems, we establish linear convergence for the
SUCAG method. When the initialization point is sufficiently close to the
optimal solution, the established convergence rate is only dependent on the
condition number of the problem, making it strictly faster than the known rate
for the SAGA method. Furthermore, we describe a Markov-driven approach of
implementing the SUCAG method in a distributed asynchronous multi-agent
setting, via gossiping along a random walk on an undirected communication
graph. We show that our analysis applies as long as the graph is connected and,
notably, establishes an asymptotic linear convergence rate that is robust to
the graph topology. Numerical results demonstrate the merits of our algorithm
over existing methods.Comment: to appear in CDC 2018, 17 pages, 2 figure
Convergence Rate of Distributed Optimization Algorithms Based on Gradient Tracking
We study distributed, strongly convex and nonconvex, multiagent optimization
over (directed, time-varying) graphs. We consider the minimization of the sum
of a smooth (possibly nonconvex) function--the agent's sum-utility plus a
nonsmooth convex one, subject to convex constraints. In a companion paper, we
introduced SONATA, the first algorithmic framework applicable to such a general
class of composite minimization, and we studied its convergence when the smooth
part of the objective function is nonconvex. The algorithm combines successive
convex approximation techniques with a perturbed push-sum consensus mechanism
that aims to track locally the gradient of the (smooth part of the)
sum-utility. This paper studies the convergence rate of SONATA. When the smooth
part of the objective function is strongly convex, SONATA is proved to converge
at a linear rate whereas sublinar rate is proved when the objective function is
nonconvex. To our knowledge, this is the first work proving a convergence rate
(in particular, linear rate) for distributed algorithms applicable to such a
general class of composite, constrained optimization problems over graphs
Distributed Stochastic Multi-Task Learning with Graph Regularization
We propose methods for distributed graph-based multi-task learning that are
based on weighted averaging of messages from other machines. Uniform averaging
or diminishing stepsize in these methods would yield consensus (single task)
learning. We show how simply skewing the averaging weights or controlling the
stepsize allows learning different, but related, tasks on the different
machines
Understanding the Learned Iterative Soft Thresholding Algorithm with matrix factorization
Sparse coding is a core building block in many data analysis and machine
learning pipelines. Typically it is solved by relying on generic optimization
techniques, such as the Iterative Soft Thresholding Algorithm and its
accelerated version (ISTA, FISTA). These methods are optimal in the class of
first-order methods for non-smooth, convex functions. However, they do not
exploit the particular structure of the problem at hand nor the input data
distribution. An acceleration using neural networks, coined LISTA, was proposed
in Gregor and Le Cun (2010), which showed empirically that one could achieve
high quality estimates with few iterations by modifying the parameters of the
proximal splitting appropriately.
In this paper we study the reasons for such acceleration. Our mathematical
analysis reveals that it is related to a specific matrix factorization of the
Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel
with a basis that produces a small perturbation of the ball. When this
factorization succeeds, we prove that the resulting splitting algorithm enjoys
an improved convergence bound with respect to the non-adaptive version.
Moreover, our analysis also shows that conditions for acceleration occur mostly
at the beginning of the iterative process, consistent with numerical
experiments. We further validate our analysis by showing that on dictionaries
where this factorization does not exist, adaptive acceleration fails.Comment: Ongoing work - This document is not complete and might contains
errors. arXiv admin note: text overlap with arXiv:1609.0028
The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares
Minimax optimal convergence rates for classes of stochastic convex
optimization problems are well characterized, where the majority of results
utilize iterate averaged stochastic gradient descent (SGD) with polynomially
decaying step sizes. In contrast, SGD's final iterate behavior has received
much less attention despite their widespread use in practice. Motivated by this
observation, this work provides a detailed study of the following question:
what rate is achievable using the final iterate of SGD for the streaming least
squares regression problem with and without strong convexity?
First, this work shows that even if the time horizon T (i.e. the number of
iterations SGD is run for) is known in advance, SGD's final iterate behavior
with any polynomially decaying learning rate scheme is highly sub-optimal
compared to the minimax rate (by a condition number factor in the strongly
convex case and a factor of in the non-strongly convex case). In
contrast, this paper shows that Step Decay schedules, which cut the learning
rate by a constant factor every constant number of epochs (i.e., the learning
rate decays geometrically) offers significant improvements over any
polynomially decaying step sizes. In particular, the final iterate behavior
with a step decay schedule is off the minimax rate by only factors (in
the condition number for strongly convex case, and in T for the non-strongly
convex case). Finally, in stark contrast to the known horizon case, this paper
shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is
poor (in that it queries iterates with highly sub-optimal function value
infinitely often, i.e. in a limsup sense) irrespective of the stepsizes
employed. These results demonstrate the subtlety in establishing optimal
learning rate schemes (for the final iterate) for stochastic gradient
procedures in fixed time horizon settings.Comment: Appears in the proceedings of the Conference on Neural Information
Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7
figure
- …