67 research outputs found
A Simple Practical Accelerated Method for Finite Sums
We describe a novel optimization method for finite sums (such as empirical
risk minimization problems) building on the recently introduced SAGA method.
Our method achieves an accelerated convergence rate on strongly convex smooth
problems. Our method has only one parameter (a step size), and is radically
simpler than other accelerated methods for finite sums. Additionally it can be
applied when the terms are non-smooth, yielding a method applicable in many
areas where operator splitting methods would traditionally be applied
Linear Convergence of Cyclic SAGA
In this work, we present and analyze C-SAGA, a (deterministic) cyclic variant
of SAGA. C-SAGA is an incremental gradient method that minimizes a sum of
differentiable convex functions by cyclically accessing their gradients. Even
though the theory of stochastic algorithms is more mature than that of cyclic
counterparts in general, practitioners often prefer cyclic algorithms. We prove
C-SAGA converges linearly under the standard assumptions. Then, we compare the
rate of convergence with the full gradient method, (stochastic) SAGA, and
incremental aggregated gradient (IAG), theoretically and experimentally.Comment: Published in Optimization Letter
ASVRG: Accelerated Proximal SVRG
This paper proposes an accelerated proximal stochastic variance reduced
gradient (ASVRG) method, in which we design a simple and effective momentum
acceleration trick. Unlike most existing accelerated stochastic variance
reduction methods such as Katyusha, ASVRG has only one additional variable and
one momentum parameter. Thus, ASVRG is much simpler than those methods, and has
much lower per-iteration complexity. We prove that ASVRG achieves the best
known oracle complexities for both strongly convex and non-strongly convex
objectives. In addition, we extend ASVRG to mini-batch and non-smooth settings.
We also empirically verify our theoretical results and show that the
performance of ASVRG is comparable with, and sometimes even better than that of
the state-of-the-art stochastic methods.Comment: 32 pages, 3 figure
The proximal point method revisited
In this short survey, I revisit the role of the proximal point method in
large scale optimization. I focus on three recent examples: a proximally guided
subgradient method for weakly convex stochastic approximation, the prox-linear
algorithm for minimizing compositions of convex functions and smooth maps, and
Catalyst generic acceleration for regularized Empirical Risk Minimization.Comment: 11 pages, submitted to SIAG/OPT Views and New
On the Ineffectiveness of Variance Reduced Optimization for Deep Learning
The application of stochastic variance reduction to optimization has shown
remarkable recent theoretical and practical success. The applicability of these
techniques to the hard non-convex optimization problems encountered during
training of modern deep neural networks is an open problem. We show that naive
application of the SVRG technique and related approaches fail, and explore why
Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication
Recently, the decentralized optimization problem is attracting growing
attention. Most existing methods are deterministic with high per-iteration cost
and have a convergence rate quadratically depending on the problem condition
number. Besides, the dense communication is necessary to ensure the convergence
even if the dataset is sparse. In this paper, we generalize the decentralized
optimization problem to a monotone operator root finding problem, and propose a
stochastic algorithm named DSBA that (i) converges geometrically with a rate
linearly depending on the problem condition number, and (ii) can be implemented
using sparse communication only. Additionally, DSBA handles learning problems
like AUC-maximization which cannot be tackled efficiently in the decentralized
setting. Experiments on convex minimization and AUC-maximization validate the
efficiency of our method.Comment: Accepted to ICML 201
First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization
This paper studies empirical risk minimization (ERM) problems for large-scale
datasets and incorporates the idea of adaptive sample size methods to improve
the guaranteed convergence bounds for first-order stochastic and deterministic
methods. In contrast to traditional methods that attempt to solve the ERM
problem corresponding to the full dataset directly, adaptive sample size
schemes start with a small number of samples and solve the corresponding ERM
problem to its statistical accuracy. The sample size is then grown
geometrically -- e.g., scaling by a factor of two -- and use the solution of
the previous ERM as a warm start for the new ERM. Theoretical analyses show
that the use of adaptive sample size methods reduces the overall computational
cost of achieving the statistical accuracy of the whole dataset for a broad
range of deterministic and stochastic first-order methods. The gains are
specific to the choice of method. When particularized to, e.g., accelerated
gradient descent and stochastic variance reduce gradient, the computational
cost advantage is a logarithm of the number of training samples. Numerical
experiments on various datasets confirm theoretical claims and showcase the
gains of using the proposed adaptive sample size scheme
Stochastic Nonconvex Optimization with Large Minibatches
We study stochastic optimization of nonconvex loss functions, which are
typical objectives for training neural networks. We propose stochastic
approximation algorithms which optimize a series of regularized, nonlinearized
losses on large minibatches of samples, using only first-order gradient
information. Our algorithms provably converge to an approximate critical point
of the expected objective with faster rates than minibatch stochastic gradient
descent, and facilitate better parallelization by allowing larger minibatches.Comment: Accepted by the ALT 201
Curvature-Exploiting Acceleration of Elastic Net Computations
This paper introduces an efficient second-order method for solving the
elastic net problem. Its key innovation is a computationally efficient
technique for injecting curvature information in the optimization process which
admits a strong theoretical performance guarantee. In particular, we show
improved run time over popular first-order methods and quantify the speed-up in
terms of statistical measures of the data matrix. The improved time complexity
is the result of an extensive exploitation of the problem structure and a
careful combination of second-order information, variance reduction techniques,
and momentum acceleration. Beside theoretical speed-up, experimental results
demonstrate great practical performance benefits of curvature information,
especially for ill-conditioned data sets.Comment: 34 pages, 2 figure
Boosting First-order Methods by Shifting Objective: New Schemes with Faster Worst Case Rates
We propose a new methodology to design first-order methods for unconstrained
strongly convex problems, i.e., to design for a shifted objective function.
Several technical lemmas are provided as the building blocks for designing new
methods. By shifting objective, the analysis is tightened, which leaves space
for faster rates, and also simplified. Following this methodology, we derived
several new accelerated schemes for problems that equipped with various
first-order oracles, and all of the derived methods have faster worst case
convergence rates than their existing counterparts. Experiments on machine
learning tasks are conducted to evaluate the new methods.Comment: 27 pages, 7 figure
- …