29,017 research outputs found
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence
In this paper, we study stochastic non-convex optimization with non-convex
random functions. Recent studies on non-convex optimization revolve around
establishing second-order convergence, i.e., converging to a nearly
second-order optimal stationary points. However, existing results on stochastic
non-convex optimization are limited, especially with a high probability
second-order convergence. We propose a novel updating step (named NCG-S) by
leveraging a stochastic gradient and a noisy negative curvature of a stochastic
Hessian, where the stochastic gradient and Hessian are based on a proper
mini-batch of random functions. Building on this step, we develop two
algorithms and establish their high probability second-order convergence. To
the best of our knowledge, the proposed stochastic algorithms are the first
with a second-order convergence in {\it high probability} and a time complexity
that is {\it almost linear} in the problem's dimensionality.Comment: This short paper will appear at NIPS 2017 Optimization of Machine
Learning Workshop. Partial results are presented in arXiv:1709.08571. The
second version corrects a statement regarding previous wor
On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization
The Hessian-vector product has been utilized to find a second-order
stationary solution with strong complexity guarantee (e.g., almost linear time
complexity in the problem's dimensionality). In this paper, we propose to
further reduce the number of Hessian-vector products for faster non-convex
optimization. Previous algorithms need to approximate the smallest eigen-value
with a sufficient precision (e.g., ) in order to achieve a
sufficiently accurate second-order stationary solution (i.e.,
\lambda_{\min}(\nabla^2 f(\x))\geq -\epsilon_2). In contrast, the proposed
algorithms only need to compute the smallest eigen-vector approximating the
corresponding eigen-value up to a small power of current gradient's norm. As a
result, it can dramatically reduce the number of Hessian-vector products during
the course of optimization before reaching first-order stationary points (e.g.,
saddle points). The key building block of the proposed algorithms is a novel
updating step named the NCG step, which lets a noisy negative curvature descent
compete with the gradient descent. We show that the worst-case time complexity
of the proposed algorithms with their favorable prescribed accuracy
requirements can match the best in literature for achieving a second-order
stationary point but with an arguably smaller per-iteration cost. We also show
that the proposed algorithms can benefit from inexact Hessian by developing
their variants accepting inexact Hessian under a mild condition for achieving
the same goal. Moreover, we develop a stochastic algorithm for a finite or
infinite sum non-convex optimization problem. To the best of our knowledge, the
proposed stochastic algorithm is the first one that converges to a second-order
stationary point in {\it high probability} with a time complexity independent
of the sample size and almost linear in dimensionality.Comment: added a stochastic algorithm with high probability second-order
convergence and corrected some typo
Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions
Error bound conditions (EBC) are properties that characterize the growth of
an objective function when a point is moved away from the optimal set. They
have recently received increasing attention in the field of optimization for
developing optimization algorithms with fast convergence. However, the studies
of EBC in statistical learning are hitherto still limited. The main
contributions of this paper are two-fold. First, we develop fast and
intermediate rates of empirical risk minimization (ERM) under EBC for risk
minimization with Lipschitz continuous, and smooth convex random functions.
Second, we establish fast and intermediate rates of an efficient stochastic
approximation (SA) algorithm for risk minimization with Lipschitz continuous
random functions, which requires only one pass of samples and adapts to
EBC. For both approaches, the convergence rates span a full spectrum between
and depending on the power
constant in EBC, and could be even faster than in special cases for
ERM. Moreover, these convergence rates are automatically adaptive without using
any knowledge of EBC. Overall, this work not only strengthens the understanding
of ERM for statistical learning but also brings new fast stochastic algorithms
for solving a broad range of statistical learning problems
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping
In this paper, we propose a new accelerated stochastic first-order method
called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed
distributed noise in stochastic gradients and derive the first high-probability
complexity bounds for this method closing the gap in the theory of stochastic
optimization with heavy-tailed noise. Our method is based on a special variant
of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic
gradients. We extend our method to the strongly convex case and prove new
complexity bounds that outperform state-of-the-art results in this case.
Finally, we extend our proof technique and derive the first non-trivial
high-probability complexity bounds for SGD with clipping without light-tails
assumption on the noise.Comment: 71 pages, 14 figure
Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions
We develop and analyze stochastic optimization algorithms for problems in
which the expected loss is strongly convex, and the optimum is (approximately)
sparse. Previous approaches are able to exploit only one of these two
structures, yielding an \order(\pdim/T) convergence rate for strongly convex
objectives in \pdim dimensions, and an \order(\sqrt{(\spindex \log
\pdim)/T}) convergence rate when the optimum is \spindex-sparse. Our
algorithm is based on successively solving a series of -regularized
optimization problems using Nesterov's dual averaging algorithm. We establish
that the error of our solution after iterations is at most
\order((\spindex \log\pdim)/T), with natural extensions to approximate
sparsity. Our results apply to locally Lipschitz losses including the logistic,
exponential, hinge and least-squares losses. By recourse to statistical minimax
results, we show that our convergence rates are optimal up to multiplicative
constant factors. The effectiveness of our approach is also confirmed in
numerical simulations, in which we compare to several baselines on a
least-squares regression problem.Comment: 2 figure
Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition
In this paper, a new theory is developed for first-order stochastic convex
optimization, showing that the global convergence rate is sufficiently
quantified by a local growth rate of the objective function in a neighborhood
of the optimal solutions. In particular, if the objective function in the -sublevel set grows as fast as , where represents the closest optimal
solution to and quantifies the local growth rate,
the iteration complexity of first-order stochastic optimization for achieving
an -optimal solution can be ,
which is optimal at most up to a logarithmic factor. To achieve the faster
global convergence, we develop two different accelerated stochastic subgradient
methods by iteratively solving the original problem approximately in a local
region around a historical solution with the size of the local region gradually
decreasing as the solution approaches the optimal set. Besides the theoretical
improvements, this work also includes new contributions towards making the
proposed algorithms practical: (i) we present practical variants of accelerated
stochastic subgradient methods that can run without the knowledge of
multiplicative growth constant and even the growth rate ; (ii) we
consider a broad family of problems in machine learning to demonstrate that the
proposed algorithms enjoy faster convergence than traditional stochastic
subgradient method. We also characterize the complexity of the proposed
algorithms for ensuring the gradient is small without the smoothness
assumption
Stochastic Primal-Dual Algorithms with Faster Convergence than for Problems without Bilinear Structure
Previous studies on stochastic primal-dual algorithms for solving min-max
problems with faster convergence heavily rely on the bilinear structure of the
problem, which restricts their applicability to a narrowed range of problems.
The main contribution of this paper is the design and analysis of new
stochastic primal-dual algorithms that use a mixture of stochastic gradient
updates and a logarithmic number of deterministic dual updates for solving a
family of convex-concave problems with no bilinear structure assumed. Faster
convergence rates than with being the number of stochastic
gradient updates are established under some mild conditions of involved
functions on the primal and the dual variable. For example, for a family of
problems that enjoy a weak strong convexity in terms of the primal variable and
has a strongly concave function of the dual variable, the convergence rate of
the proposed algorithm is . We also investigate the effectiveness of
the proposed algorithms for learning robust models and empirical AUC
maximization
MixedGrad: An O(1/T) Convergence Rate Algorithm for Stochastic Smooth Optimization
It is well known that the optimal convergence rate for stochastic
optimization of smooth functions is , which is same as
stochastic optimization of Lipschitz continuous convex functions. This is in
contrast to optimizing smooth functions using full gradients, which yields a
convergence rate of . In this work, we consider a new setup for
optimizing smooth functions, termed as {\bf Mixed Optimization}, which allows
to access both a stochastic oracle and a full gradient oracle. Our goal is to
significantly improve the convergence rate of stochastic optimization of smooth
functions by having an additional small number of accesses to the full gradient
oracle. We show that, with an calls to the full gradient oracle and
an calls to the stochastic oracle, the proposed mixed optimization
algorithm is able to achieve an optimization error of
Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the Convergence Rate
Stochastic approximation (SA) is a classical approach for stochastic convex
optimization. Previous studies have demonstrated that the convergence rate of
SA can be improved by introducing either smoothness or strong convexity
condition. In this paper, we make use of smoothness and strong convexity
simultaneously to boost the convergence rate. Let be the modulus of
strong convexity, be the condition number, be the minimal risk,
and be some small constant. First, we demonstrate that, in
expectation, an risk bound is
attainable when . Thus, when is small, the
convergence rate could be faster than and approaches
in the ideal case. Second, to further benefit from
small risk, we show that, in expectation, an risk bound
is achievable. Thus, the excess risk reduces exponentially until reaching
, and if , we obtain a global linear convergence. Finally, we
emphasize that our proof is constructive and each risk bound is equipped with
an efficient stochastic algorithm attaining that bound
Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity
We develop model-based methods for solving stochastic convex optimization
problems, introducing the approximate-proximal point, or aProx, family, which
includes stochastic subgradient, proximal point, and bundle methods. When the
modeling approaches we propose are appropriately accurate, the methods enjoy
stronger convergence and robustness guarantees than classical approaches, even
though the model-based methods typically add little to no computational
overhead over stochastic subgradient methods. For example, we show that
improved models converge with probability 1 and enjoy optimal asymptotic
normality results under weak assumptions; these methods are also adaptive to a
natural class of what we term easy optimization problems, achieving linear
convergence under appropriate strong growth conditions on the objective. Our
substantial experimental investigation shows the advantages of more accurate
modeling over standard subgradient methods across many smooth and non-smooth
optimization problems.Comment: To appear in SIAM Journal on Optimizatio
- …