4,747 research outputs found
A Stochastic Quasi-Newton Method with Nesterov's Accelerated Gradient
Incorporating second order curvature information in gradient based methods
have shown to improve convergence drastically despite its computational
intensity. In this paper, we propose a stochastic (online) quasi-Newton method
with Nesterov's accelerated gradient in both its full and limited memory forms
for solving large scale non-convex optimization problems in neural networks.
The performance of the proposed algorithm is evaluated in Tensorflow on
benchmark classification and regression problems. The results show improved
performance compared to the classical second order oBFGS and oLBFGS methods and
popular first order stochastic methods such as SGD and Adam. The performance
with different momentum rates and batch sizes have also been illustrated.Comment: Accepted at ECML-PKDD 201
A Linearly-Convergent Stochastic L-BFGS Algorithm
We propose a new stochastic L-BFGS algorithm and prove a linear convergence
rate for strongly convex and smooth functions. Our algorithm draws heavily from
a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as
a recent approach to variance reduction for stochastic gradient descent from
Johnson and Zhang (2013). We demonstrate experimentally that our algorithm
performs well on large-scale convex and non-convex optimization problems,
exhibiting linear convergence and rapidly solving the optimization problems to
high levels of precision. Furthermore, we show that our algorithm performs well
for a wide-range of step sizes, often differing by several orders of magnitude.Comment: 10 pages, 3 figures in International Conference on Artificial
Intelligence and Statistics, 201
A Stochastic Quasi-Newton Method for Large-Scale Optimization
The question of how to incorporate curvature information in stochastic
approximation methods is challenging. The direct application of classical
quasi- Newton updating techniques for deterministic optimization leads to noisy
curvature estimates that have harmful effects on the robustness of the
iteration. In this paper, we propose a stochastic quasi-Newton method that is
efficient, robust and scalable. It employs the classical BFGS update formula in
its limited memory form, and is based on the observation that it is beneficial
to collect curvature information pointwise, and at regular intervals, through
(sub-sampled) Hessian-vector products. This technique differs from the
classical approach that would compute differences of gradients, and where
controlling the quality of the curvature estimates can be difficult. We present
numerical results on problems arising in machine learning that suggest that the
proposed method shows much promise
Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization
In this paper we study stochastic quasi-Newton methods for nonconvex
stochastic optimization, where we assume that noisy information about the
gradients of the objective function is available via a stochastic first-order
oracle (SFO). We propose a general framework for such methods, for which we
prove almost sure convergence to stationary points and analyze its worst-case
iteration complexity. When a randomly chosen iterate is returned as the output
of such an algorithm, we prove that in the worst-case, the SFO-calls complexity
is to ensure that the expectation of the squared norm of the
gradient is smaller than the given accuracy tolerance . We also
propose a specific algorithm, namely a stochastic damped L-BFGS (SdLBFGS)
method, that falls under the proposed framework. {Moreover, we incorporate the
SVRG variance reduction technique into the proposed SdLBFGS method, and analyze
its SFO-calls complexity. Numerical results on a nonconvex binary
classification problem using SVM, and a multiclass classification problem using
neural networks are reported.Comment: published in SIAM Journal on Optimizatio
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients
Modern statistical inference tasks often require iterative optimization
methods to compute the solution. Convergence analysis from an optimization
viewpoint only informs us how well the solution is approximated numerically but
overlooks the sampling nature of the data. In contrast, recognizing the
randomness in the data, statisticians are keen to provide uncertainty
quantification, or confidence, for the solution obtained using iterative
optimization methods. This paper makes progress along this direction by
introducing the moment-adjusted stochastic gradient descents, a new stochastic
optimization method for statistical inference. We establish non-asymptotic
theory that characterizes the statistical distribution for certain iterative
methods with optimization guarantees. On the statistical front, the theory
allows for model mis-specification, with very mild conditions on the data. For
optimization, the theory is flexible for both convex and non-convex cases.
Remarkably, the moment-adjusting idea motivated from "error standardization" in
statistics achieves a similar effect as acceleration in first-order
optimization methods used to fit generalized linear models. We also demonstrate
this acceleration effect in the non-convex setting through numerical
experiments.Comment: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 2019, to appea
Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
We present an algorithm for minimizing a sum of functions that combines the
computational efficiency of stochastic gradient descent (SGD) with the second
order curvature information leveraged by quasi-Newton methods. We unify these
disparate approaches by maintaining an independent Hessian approximation for
each contributing function in the sum. We maintain computational tractability
and limit memory requirements even for high dimensional optimization problems
by storing and manipulating these quadratic approximations in a shared, time
evolving, low dimensional subspace. Each update step requires only a single
contributing function or minibatch evaluation (as in SGD), and each step is
scaled using an approximate inverse Hessian and little to no adjustment of
hyperparameters is required (as is typical for quasi-Newton methods). This
algorithm contrasts with earlier stochastic second order techniques that treat
the Hessian of each contributing function as a noisy approximation to the full
Hessian, rather than as a target for direct estimation. We experimentally
demonstrate improved convergence on seven diverse optimization problems. The
algorithm is released as open source Python and MATLAB packages
Variable Metric Stochastic Approximation Theory
We provide a variable metric stochastic approximation theory. In doing so, we
provide a convergence theory for a large class of online variable metric
methods including the recently introduced online versions of the BFGS algorithm
and its limited-memory LBFGS variant. We also discuss the implications of our
results for learning from expert advice.Comment: Correctment of theorem 3.4. from AISTATS 2009 articl
Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning
Nowadays stochastic approximation methods are one of the major research
direction to deal with the large-scale machine learning problems. From
stochastic first order methods, now the focus is shifting to stochastic second
order methods due to their faster convergence and availability of computing
resources. In this paper, we have proposed a novel Stochastic Trust RegiOn
Inexact Newton method, called as STRON, to solve large-scale learning problems
which uses conjugate gradient (CG) to inexactly solve trust region subproblem.
The method uses progressive subsampling in the calculation of gradient and
Hessian values to take the advantage of both, stochastic and full-batch
regimes. We have extended STRON using existing variance reduction techniques to
deal with the noisy gradients and using preconditioned conjugate gradient (PCG)
as subproblem solver, and empirically proved that they do not work as expected,
for the large-scale learning problems. Finally, our empirical results prove
efficacy of the proposed method against existing methods with bench marked
datasets.Comment: 32 figures, accepted in International Journal of Machine Learning and
Cybernetic
Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies
We revisit the stochastic limited-memory BFGS (L-BFGS) algorithm. By
proposing a new framework for the convergence analysis, we prove improved
convergence rates and computational complexities of the stochastic L-BFGS
algorithms compared to previous works. In addition, we propose several
practical acceleration strategies to speed up the empirical performance of such
algorithms. We also provide theoretical analyses for most of the strategies.
Experiments on large-scale logistic and ridge regression problems demonstrate
that our proposed strategies yield significant improvements vis-\`a-vis
competing state-of-the-art algorithms
- …