491 research outputs found
Catalyst Acceleration for Gradient-Based Non-Convex Optimization
We introduce a generic scheme to solve nonconvex optimization problems using
gradient-based algorithms originally designed for minimizing convex functions.
Even though these methods may originally require convexity to operate, the
proposed approach allows one to use them on weakly convex objectives, which
covers a large class of non-convex functions typically appearing in machine
learning and signal processing. In general, the scheme is guaranteed to produce
a stationary point with a worst-case efficiency typical of first-order methods,
and when the objective turns out to be convex, it automatically accelerates in
the sense of Nesterov and achieves near-optimal convergence rate in function
values. These properties are achieved without assuming any knowledge about the
convexity of the objective, by automatically adapting to the unknown weak
convexity constant. We conclude the paper by showing promising experimental
results obtained by applying our approach to incremental algorithms such as
SVRG and SAGA for sparse matrix factorization and for learning neural networks
Generalization Error Bounds with Probabilistic Guarantee for SGD in Nonconvex Optimization
The success of deep learning has led to a rising interest in the
generalization property of the stochastic gradient descent (SGD) method, and
stability is one popular approach to study it. Existing works based on
stability have studied nonconvex loss functions, but only considered the
generalization error of the SGD in expectation. In this paper, we establish
various generalization error bounds with probabilistic guarantee for the SGD.
Specifically, for both general nonconvex loss functions and gradient dominant
loss functions, we characterize the on-average stability of the iterates
generated by SGD in terms of the on-average variance of the stochastic
gradients. Such characterization leads to improved bounds for the
generalization error for SGD. We then study the regularized risk minimization
problem with strongly convex regularizers, and obtain improved generalization
error bounds for proximal SGD. With strongly convex regularizers, we further
establish the generalization error bounds for nonconvex loss functions under
proximal SGD with high-probability guarantee, i.e., exponential concentration
in probability
Conditional gradient type methods for composite nonlinear and stochastic optimization
In this paper, we present a conditional gradient type (CGT) method for
solving a class of composite optimization problems where the objective function
consists of a (weakly) smooth term and a (strongly) convex regularization term.
While including a strongly convex term in the subproblems of the classical
conditional gradient (CG) method improves its rate of convergence, it does not
cost per iteration as much as general proximal type algorithms. More
specifically, we present a unified analysis for the CGT method in the sense
that it achieves the best-known rate of convergence when the weakly smooth term
is nonconvex and possesses (nearly) optimal complexity if it turns out to be
convex. While implementation of the CGT method requires explicitly estimating
problem parameters like the level of smoothness of the first term in the
objective function, we also present a few variants of this method which relax
such estimation. Unlike general proximal type parameter free methods, these
variants of the CGT method do not require any additional effort for computing
(sub)gradients of the objective function and/or solving extra subproblems at
each iteration. We then generalize these methods under stochastic setting and
present a few new complexity results. To the best of our knowledge, this is the
first time that such complexity results are presented for solving stochastic
weakly smooth nonconvex and (strongly) convex optimization problems
Training L1-Regularized Models with Orthant-Wise Passive Descent Algorithms
The -regularized models are widely used for sparse regression or
classification tasks. In this paper, we propose the orthant-wise passive
descent algorithm (OPDA) for optimizing -regularized models, as an
improved substitute of proximal algorithms, which are the standard tools for
optimizing the models nowadays. OPDA uses a stochastic variance-reduced
gradient (SVRG) to initialize the descent direction, then apply a novel
alignment operator to encourage each element keeping the same sign after one
iteration of update, so the parameter remains in the same orthant as before. It
also explicitly suppresses the magnitude of each element to impose sparsity.
The quasi-Newton update can be utilized to incorporate curvature information
and accelerate the speed. We prove a linear convergence rate for OPDA on
general smooth and strongly-convex loss functions. By conducting experiments on
-regularized logistic regression and convolutional neural networks, we
show that OPDA outperforms state-of-the-art stochastic proximal algorithms,
implying a wide range of applications in training sparse models.Comment: Accepted to The Thirty-Second AAAI Conference on Artificial
Intelligence (AAAI-18). Feb 2018, New Orlean
Stochastic Variance Reduction Gradient for a Non-convex Problem Using Graduated Optimization
In machine learning, nonconvex optimization problems with multiple local
optimums are often encountered. Graduated Optimization Algorithm (GOA) is a
popular heuristic method to obtain global optimums of nonconvex problems
through progressively minimizing a series of convex approximations to the
nonconvex problems more and more accurate. Recently, such an algorithm GradOpt
based on GOA is proposed with amazing theoretical and experimental results, but
it mainly studies the problem which consists of one nonconvex part. This paper
aims to find the global solution of a nonconvex objective with a convex part
plus a nonconvex part based on GOA. By graduating approximating non-convex part
of the problem and minimizing them with the Stochastic Variance Reduced
Gradient (SVRG) or proximal SVRG, two new algorithms, SVRG-GOA and PSVRG-GOA,
are proposed. We prove that the new algorithms have lower iteration complexity
() than GradOpt (). Some tricks, such as
enlarging shrink factor, using project step, stochastic gradient, and
mini-batch skills, are also given to accelerate the convergence speed of the
proposed algorithms. Experimental results illustrate that the new algorithms
with the similar performance can converge to 'global' optimums of the nonconvex
problems, and they converge faster than the GradOpt and the nonconvex proximal
SVRG.Comment: 15 pages, 5 figure
On Quadratic Convergence of DC Proximal Newton Algorithm for Nonconvex Sparse Learning in High Dimensions
We propose a DC proximal Newton algorithm for solving nonconvex regularized
sparse learning problems in high dimensions. Our proposed algorithm integrates
the proximal Newton algorithm with multi-stage convex relaxation based on the
difference of convex (DC) programming, and enjoys both strong computational and
statistical guarantees. Specifically, by leveraging a sophisticated
characterization of sparse modeling structures/assumptions (i.e., local
restricted strong convexity and Hessian smoothness), we prove that within each
stage of convex relaxation, our proposed algorithm achieves (local) quadratic
convergence, and eventually obtains a sparse approximate local optimum with
optimal statistical properties after only a few convex relaxations. Numerical
experiments are provided to support our theory.Comment: 36 pages, 5 figures, 1 table, Accepted at NIPS 201
Local Convergence of the Heavy-ball Method and iPiano for Non-convex Optimization
A local convergence result for abstract descent methods is proved. The
sequence of iterates is attracted by a local (or global) minimum, stays in its
neighborhood and converges within this neighborhood. This result allows
algorithms to exploit local properties of the objective function. In
particular, the abstract theory in this paper applies to the inertial
forward--backward splitting method: iPiano---a generalization of the Heavy-ball
method. Moreover, it reveals an equivalence between iPiano and inertial
averaged/alternating proximal minimization and projection methods. Key for this
equivalence is the attraction to a local minimum within a neighborhood and the
fact that, for a prox-regular function, the gradient of the Moreau envelope is
locally Lipschitz continuous and expressible in terms of the proximal mapping.
In a numerical feasibility problem, the inertial alternating projection method
significantly outperforms its non-inertial variants
Distributed Big-Data Optimization via Block-Iterative Convexification and Averaging
In this paper, we study distributed big-data nonconvex optimization in
multi-agent networks. We consider the (constrained) minimization of the sum of
a smooth (possibly) nonconvex function, i.e., the agents' sum-utility, plus a
convex (possibly) nonsmooth regularizer. Our interest is in big-data problems
wherein there is a large number of variables to optimize. If treated by means
of standard distributed optimization algorithms, these large-scale problems may
be intractable, due to the prohibitive local computation and communication
burden at each node. We propose a novel distributed solution method whereby at
each iteration agents optimize and then communicate (in an uncoordinated
fashion) only a subset of their decision variables. To deal with non-convexity
of the cost function, the novel scheme hinges on Successive Convex
Approximation (SCA) techniques coupled with i) a tracking mechanism
instrumental to locally estimate gradient averages; and ii) a novel block-wise
consensus-based protocol to perform local block-averaging operations and
gradient tacking. Asymptotic convergence to stationary solutions of the
nonconvex problem is established. Finally, numerical results show the
effectiveness of the proposed algorithm and highlight how the block dimension
impacts on the communication overhead and practical convergence speed
Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization
With the large rising of complex data, the nonconvex models such as nonconvex
loss function and nonconvex regularizer are widely used in machine learning and
pattern recognition. In this paper, we propose a class of mini-batch stochastic
ADMMs (alternating direction method of multipliers) for solving large-scale
nonconvex nonsmooth problems. We prove that, given an appropriate mini-batch
size, the mini-batch stochastic ADMM without variance reduction (VR) technique
is convergent and reaches a convergence rate of to obtain a stationary
point of the nonconvex optimization, where denotes the number of
iterations. Moreover, we extend the mini-batch stochastic gradient method to
both the nonconvex SVRG-ADMM and SAGA-ADMM proposed in our initial manuscript
\cite{huang2016stochastic}, and prove these mini-batch stochastic ADMMs also
reaches the convergence rate of without condition on the mini-batch
size. In particular, we provide a specific parameter selection for step size
of stochastic gradients and penalty parameter of augmented
Lagrangian function. Finally, extensive experimental results on both simulated
and real-world data demonstrate the effectiveness of the proposed algorithms.Comment: We have fixed some errors in the proofs. arXiv admin note: text
overlap with arXiv:1610.0275
Anisotropic Proximal Gradient
This paper studies a novel algorithm for nonconvex composite minimization
which can be interpreted in terms of dual space nonlinear preconditioning for
the classical proximal gradient method. The proposed scheme can be applied to
composite minimization problems whose smooth part exhibits an anisotropic
descent inequality relative to a reference function. In the convex case this is
a dual characterization of relative strong convexity in the Bregman sense. It
is proved that the anisotropic descent property is closed under pointwise
average if the dual Bregman distance is jointly convex and, more specifically,
closed under pointwise conic combinations for the KL-divergence. We analyze the
method's asymptotic convergence and prove its linear convergence under an
anisotropic proximal gradient dominance condition. This is implied by
anisotropic strong convexity, a recent dual characterization of relative
smoothness in the Bregman sense. Applications are discussed including
exponentially regularized LPs and logistic regression with nonsmooth
regularization. In the LP case the method can be specialized to the Sinkhorn
algorithm for regularized optimal transport and a classical parallel update
algorithm for AdaBoost. Complementary to their existing primal interpretations
in terms of entropic subspace projections this provides a new dual
interpretation in terms of forward-backward splitting with entropic
preconditioning
- …