147 research outputs found
RSG: Beating Subgradient Method without Smoothness and Strong Convexity
In this paper, we study the efficiency of a {\bf R}estarted {\bf S}ub{\bf
G}radient (RSG) method that periodically restarts the standard subgradient
method (SG). We show that, when applied to a broad class of convex optimization
problems, RSG method can find an -optimal solution with a lower
complexity than the SG method. In particular, we first show that RSG can reduce
the dependence of SG's iteration complexity on the distance between the initial
solution and the optimal set to that between the -level set and the
optimal set {multiplied by a logarithmic factor}. Moreover, we show the
advantages of RSG over SG in solving three different families of convex
optimization problems. (a) For the problems whose epigraph is a polyhedron, RSG
is shown to converge linearly. (b) For the problems with local quadratic growth
property in the -sublevel set, RSG has an
iteration complexity. (c) For
the problems that admit a local Kurdyka-\L ojasiewicz property with a power
constant of , RSG has an
iteration complexity.
The novelty of our analysis lies at exploiting the lower bound of the
first-order optimality residual at the -level set. It is this novelty
that allows us to explore the local properties of functions (e.g., local
quadratic growth property, local Kurdyka-\L ojasiewicz property, more generally
local error bound conditions) to develop the improved convergence of RSG. { We
also develop a practical variant of RSG enjoying faster convergence than the SG
method, which can be run without knowing the involved parameters in the local
error bound condition.} We demonstrate the effectiveness of the proposed
algorithms on several machine learning tasks including regression,
classification and matrix completion.Comment: Final version accepted by JML
Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization
Recently, {\it stochastic momentum} methods have been widely adopted in
training deep neural networks. However, their convergence analysis is still
underexplored at the moment, in particular for non-convex optimization. This
paper fills the gap between practice and theory by developing a basic
convergence analysis of two stochastic momentum methods, namely stochastic
heavy-ball method and the stochastic variant of Nesterov's accelerated gradient
method. We hope that the basic convergence results developed in this paper can
serve the reference to the convergence of stochastic momentum methods and also
serve the baselines for comparison in future development of stochastic momentum
methods. The novelty of convergence analysis presented in this paper is a
unified framework, revealing more insights about the similarities and
differences between different stochastic momentum methods and stochastic
gradient method. The unified framework exhibits a continuous change from the
gradient method to Nesterov's accelerated gradient method and finally the
heavy-ball method incurred by a free parameter, which can help explain a
similar change observed in the testing error convergence behavior for deep
learning. Furthermore, our empirical results for optimizing deep neural
networks demonstrate that the stochastic variant of Nesterov's accelerated
gradient method achieves a good tradeoff (between speed of convergence in
training error and robustness of convergence in testing error) among the three
stochastic methods.Comment: Added some references and more empirical result
Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition
In this paper, a new theory is developed for first-order stochastic convex
optimization, showing that the global convergence rate is sufficiently
quantified by a local growth rate of the objective function in a neighborhood
of the optimal solutions. In particular, if the objective function in the -sublevel set grows as fast as , where represents the closest optimal
solution to and quantifies the local growth rate,
the iteration complexity of first-order stochastic optimization for achieving
an -optimal solution can be ,
which is optimal at most up to a logarithmic factor. To achieve the faster
global convergence, we develop two different accelerated stochastic subgradient
methods by iteratively solving the original problem approximately in a local
region around a historical solution with the size of the local region gradually
decreasing as the solution approaches the optimal set. Besides the theoretical
improvements, this work also includes new contributions towards making the
proposed algorithms practical: (i) we present practical variants of accelerated
stochastic subgradient methods that can run without the knowledge of
multiplicative growth constant and even the growth rate ; (ii) we
consider a broad family of problems in machine learning to demonstrate that the
proposed algorithms enjoy faster convergence than traditional stochastic
subgradient method. We also characterize the complexity of the proposed
algorithms for ensuring the gradient is small without the smoothness
assumption
Doubly Stochastic Primal-Dual Coordinate Method for Bilinear Saddle-Point Problem
We propose a doubly stochastic primal-dual coordinate optimization algorithm
for empirical risk minimization, which can be formulated as a bilinear
saddle-point problem. In each iteration, our method randomly samples a block of
coordinates of the primal and dual solutions to update. The linear convergence
of our method could be established in terms of 1) the distance from the current
iterate to the optimal solution and 2) the primal-dual objective gap. We show
that the proposed method has a lower overall complexity than existing
coordinate methods when either the data matrix has a factorized structure or
the proximal mapping on each block is computationally expensive, e.g.,
involving an eigenvalue decomposition. The efficiency of the proposed method is
confirmed by empirical studies on several real applications, such as the
multi-task large margin nearest neighbor problem
On Data Preconditioning for Regularized Loss Minimization
In this work, we study data preconditioning, a well-known and long-existing
technique, for boosting the convergence of first-order methods for regularized
loss minimization. It is well understood that the condition number of the
problem, i.e., the ratio of the Lipschitz constant to the strong convexity
modulus, has a harsh effect on the convergence of the first-order optimization
methods. Therefore, minimizing a small regularized loss for achieving good
generalization performance, yielding an ill conditioned problem, becomes the
bottleneck for big data problems. We provide a theory on data preconditioning
for regularized loss minimization. In particular, our analysis exhibits an
appropriate data preconditioner and characterizes the conditions on the loss
function and on the data under which data preconditioning can reduce the
condition number and therefore boost the convergence for minimizing the
regularized loss. To make the data preconditioning practically useful, we
endeavor to employ and analyze a random sampling approach to efficiently
compute the preconditioned data. The preliminary experiments validate our
theory
Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than
In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS)
algorithm for solving a family of non-smooth problems that is composed of a
non-smooth term with an explicit max-structure and a smooth term or a simple
non-smooth term whose proximal mapping is easy to compute. The best known
iteration complexity for solving such non-smooth optimization problems is
without any assumption on the strong convexity. In this work,
we will show that the proposed HOPS achieved a lower iteration complexity of
\footnote{ suppresses a
logarithmic factor.} with capturing the local sharpness of the
objective function around the optimal solutions. To the best of our knowledge,
this is the lowest iteration complexity achieved so far for the considered
non-smooth optimization problems without strong convexity assumption. The HOPS
algorithm employs Nesterov's smoothing technique and Nesterov's accelerated
gradient method and runs in stages, which gradually decreases the smoothing
parameter in a stage-wise manner until it yields a sufficiently good
approximation of the original function. We show that HOPS enjoys a linear
convergence for many well-known non-smooth problems (e.g., empirical risk
minimization with a piece-wise linear loss function and norm
regularizer, finding a point in a polyhedron, cone programming, etc).
Experimental results verify the effectiveness of HOPS in comparison with
Nesterov's smoothing algorithm and the primal-dual style of first-order
methods.Comment: This is a long version of the paper accepted by NIPS 201
Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees
In this paper, we study a fast approximation method for {\it large-scale
high-dimensional} sparse least-squares regression problem by exploiting the
Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional
vectors into a low-dimensional space. In particular, we propose to apply the JL
transforms to the data matrix and the target vector and then to solve a sparse
least-squares problem on the compressed data with a {\it slightly larger
regularization parameter}. Theoretically, we establish the optimization error
bound of the learned model for two different sparsity-inducing regularizers,
i.e., the elastic net and the norm. Compared with previous relevant
work, our analysis is {\it non-asymptotic and exhibits more insights} on the
bound, the sample complexity and the regularization. As an illustration, we
also provide an error bound of the {\it Dantzig selector} under JL transforms
First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems
In this paper, we consider first-order convergence theory and algorithms for
solving a class of non-convex non-concave min-max saddle-point problems, whose
objective function is weakly convex in the variables of minimization and weakly
concave in the variables of maximization. It has many important applications in
machine learning including training Generative Adversarial Nets (GANs). We
propose an algorithmic framework motivated by the inexact proximal point
method, where the weakly monotone variational inequality (VI) corresponding to
the original min-max problem is solved through approximately solving a sequence
of strongly monotone VIs constructed by adding a strongly monotone mapping to
the original gradient mapping. We prove first-order convergence to a nearly
stationary solution of the original min-max problem of the generic algorithmic
framework and establish different rates by employing different algorithms for
solving each strongly monotone VI. Experiments verify the convergence theory
and also demonstrate the effectiveness of the proposed methods on training
GANs.Comment: In this revised version, we changed title to "First-order Convergence
Theory for Weakly-Convex-Weakly-Concave Min-max Problems" and added more
experimental result
Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning
Min-max saddle-point problems have broad applications in many tasks in
machine learning, e.g., distributionally robust learning, learning with
non-decomposable loss, or learning with uncertain data. Although convex-concave
saddle-point problems have been broadly studied with efficient algorithms and
solid theories available, it remains a challenge to design provably efficient
algorithms for non-convex saddle-point problems, especially when the objective
function involves an expectation or a large-scale finite sum. Motivated by
recent literature on non-convex non-smooth minimization, this paper studies a
family of non-convex min-max problems where the minimization component is
non-convex (weakly convex) and the maximization component is concave. We
propose a proximally guided stochastic subgradient method and a proximally
guided stochastic variance-reduced method for expected and finite-sum
saddle-point problems, respectively. We establish the computation complexities
of both methods for finding a nearly stationary point of the corresponding
minimization problem
Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity
We study distributed optimization algorithms for minimizing the average of
convex functions. The applications include empirical risk minimization problems
in statistical machine learning where the datasets are large and have to be
stored on different machines. We design a distributed stochastic variance
reduced gradient algorithm that, under certain conditions on the condition
number, simultaneously achieves the optimal parallel runtime, amount of
communication and rounds of communication among all distributed first-order
methods up to constant factors. Our method and its accelerated extension also
outperform existing distributed algorithms in terms of the rounds of
communication as long as the condition number is not too large compared to the
size of data in each machine. We also prove a lower bound for the number of
rounds of communication for a broad class of distributed first-order methods
including the proposed algorithms in this paper. We show that our accelerated
distributed stochastic variance reduced gradient algorithm achieves this lower
bound so that it uses the fewest rounds of communication among all distributed
first-order algorithms.Comment: significant addition to both theory and experimental result
- …