657 research outputs found
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions
We provide the first non-asymptotic analysis for finding stationary points of
nonsmooth, nonconvex functions. In particular, we study the class of Hadamard
semi-differentiable functions, perhaps the largest class of nonsmooth functions
for which the chain rule of calculus holds. This class contains examples such
as ReLU neural networks and others with non-differentiable activation
functions. We first show that finding an -stationary point with
first-order methods is impossible in finite time. We then introduce the notion
of -stationarity, which allows for an
-approximate gradient to be the convex combination of generalized
gradients evaluated at points within distance to the solution. We
propose a series of randomized first-order methods and analyze their complexity
of finding a -stationary point. Furthermore, we provide a
lower bound and show that our stochastic algorithm has min-max optimal
dependence on . Empirically, our methods perform well for training ReLU
neural networks
Subgradient Descent Learns Orthogonal Dictionaries
This paper concerns dictionary learning, i.e., sparse coding, a fundamental
representation learning problem. We show that a subgradient descent algorithm,
with random initialization, can provably recover orthogonal dictionaries on a
natural nonsmooth, nonconvex minimization formulation of the problem,
under mild statistical assumptions on the data. This is in contrast to previous
provable methods that require either expensive computation or delicate
initialization schemes. Our analysis develops several tools for characterizing
landscapes of nonsmooth functions, which might be of independent interest for
provable training of deep networks with nonsmooth activations (e.g., ReLU),
among numerous other applications. Preliminary experiments corroborate our
analysis and show that our algorithm works well empirically in recovering
orthogonal dictionaries
Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems
In this paper, we introduce a stochastic projected subgradient method for
weakly convex (i.e., uniformly prox-regular) nonsmooth, nonconvex functions---a
wide class of functions which includes the additive and convex composite
classes. At a high-level, the method is an inexact proximal point iteration in
which the strongly convex proximal subproblems are quickly solved with a
specialized stochastic projected subgradient method. The primary contribution
of this paper is a simple proof that the proposed algorithm converges at the
same rate as the stochastic gradient method for smooth nonconvex problems. This
result appears to be the first convergence rate analysis of a stochastic (or
even deterministic) subgradient method for the class of weakly convex
functions.Comment: Updated 9/17/2018: Major Revision -added high probability bounds,
improved convergence analysis in general, new experimental results. Updated
7/26/2017: Added references to introduction and a couple simple extensions as
Sections 3.2 and 4. Updated 8/23/2017: Added NSF acknowledgements. Updated
10/16/2017: Added experimental result
Proximal Gradient Method for Nonsmooth Optimization over the Stiefel Manifold
We consider optimization problems over the Stiefel manifold whose objective
function is the summation of a smooth function and a nonsmooth function.
Existing methods for solving this kind of problems can be classified into three
classes. Algorithms in the first class rely on information of the subgradients
of the objective function and thus tend to converge slowly in practice.
Algorithms in the second class are proximal point algorithms, which involve
subproblems that can be as difficult as the original problem. Algorithms in the
third class are based on operator-splitting techniques, but they usually lack
rigorous convergence guarantees. In this paper, we propose a retraction-based
proximal gradient method for solving this class of problems. We prove that the
proposed method globally converges to a stationary point. Iteration complexity
for obtaining an -stationary solution is also analyzed. Numerical
results on solving sparse PCA and compressed modes problems are reported to
demonstrate the advantages of the proposed method
Complexity of finding near-stationary points of convex functions stochastically
In a recent paper, we showed that the stochastic subgradient method applied
to a weakly convex problem, drives the gradient of the Moreau envelope to zero
at the rate . In this supplementary note, we present a stochastic
subgradient method for minimizing a convex function, with the improved rate
.Comment: 9 page
SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms
SARAH and SPIDER are two recently developed stochastic variance-reduced
algorithms, and SPIDER has been shown to achieve a near-optimal first-order
oracle complexity in smooth nonconvex optimization. However, SPIDER uses an
accuracy-dependent stepsize that slows down the convergence in practice, and
cannot handle objective functions that involve nonsmooth regularizers. In this
paper, we propose SpiderBoost as an improved scheme, which allows to use a much
larger constant-level stepsize while maintaining the same near-optimal oracle
complexity, and can be extended with proximal mapping to handle composite
optimization (which is nonsmooth and nonconvex) with provable convergence
guarantee. In particular, we show that proximal SpiderBoost achieves an oracle
complexity of in
composite nonconvex optimization, improving the state-of-the-art result by a
factor of . We further develop a
novel momentum scheme to accelerate SpiderBoost for composite optimization,
which achieves the near-optimal oracle complexity in theory and substantial
improvement in experiments.Comment: Appear in NeurIPS 201
Catalyst Acceleration for Gradient-Based Non-Convex Optimization
We introduce a generic scheme to solve nonconvex optimization problems using
gradient-based algorithms originally designed for minimizing convex functions.
Even though these methods may originally require convexity to operate, the
proposed approach allows one to use them on weakly convex objectives, which
covers a large class of non-convex functions typically appearing in machine
learning and signal processing. In general, the scheme is guaranteed to produce
a stationary point with a worst-case efficiency typical of first-order methods,
and when the objective turns out to be convex, it automatically accelerates in
the sense of Nesterov and achieves near-optimal convergence rate in function
values. These properties are achieved without assuming any knowledge about the
convexity of the objective, by automatically adapting to the unknown weak
convexity constant. We conclude the paper by showing promising experimental
results obtained by applying our approach to incremental algorithms such as
SVRG and SAGA for sparse matrix factorization and for learning neural networks
A Smoothing SQP Framework for a Class of Composite Minimization over Polyhedron
The composite minimization problem over a general polyhedron
has received various applications in machine learning, wireless communications,
image restoration, signal reconstruction, etc. This paper aims to provide a
theoretical study on this problem. Firstly, we show that for any fixed ,
finding the global minimizer of the problem, even its unconstrained
counterpart, is strongly NP-hard. Secondly, we derive Karush-Kuhn-Tucker (KKT)
optimality conditions for local minimizers of the problem. Thirdly, we propose
a smoothing sequential quadratic programming framework for solving this
problem. The framework requires a (approximate) solution of a convex quadratic
program at each iteration. Finally, we analyze the worst-case iteration
complexity of the framework for returning an -KKT point; i.e., a
feasible point that satisfies a perturbed version of the derived KKT optimality
conditions. To the best of our knowledge, the proposed framework is the first
one with a worst-case iteration complexity guarantee for solving composite
minimization over a general polyhedron
Graphical Convergence of Subgradients in Nonconvex Optimization and Learning
We investigate the stochastic optimization problem of minimizing population
risk, where the loss defining the risk is assumed to be weakly convex.
Compositions of Lipschitz convex functions with smooth maps are the primary
examples of such losses. We analyze the estimation quality of such nonsmooth
and nonconvex problems by their sample average approximations. Our main results
establish dimension-dependent rates on subgradient estimation in full
generality and dimension-independent rates when the loss is a generalized
linear model. As an application of the developed techniques, we analyze the
nonsmooth landscape of a robust nonlinear regression problem.Comment: 36 page
Asynchronous Parallel Algorithms for Nonconvex Optimization
We propose a new asynchronous parallel block-descent algorithmic framework
for the minimization of the sum of a smooth nonconvex function and a nonsmooth
convex one, subject to both convex and nonconvex constraints. The proposed
framework hinges on successive convex approximation techniques and a novel
probabilistic model that captures key elements of modern computational
architectures and asynchronous implementations in a more faithful way than
current state-of-the-art models. Other key features of the framework are: i) it
covers in a unified way several specific solution methods; ii) it accommodates
a variety of possible parallel computing architectures; and iii) it can deal
with nonconvex constraints. Almost sure convergence to stationary solutions is
proved, and theoretical complexity results are provided, showing nearly ideal
linear speedup when the number of workers is not too large.Comment: This is the first part of a two-paper work. The second part can be
found at: arXiv:1701.0490
- …