3,620 research outputs found
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks
Progress in deep learning is slowed by the days or weeks it takes to train
large models. The natural solution of using more hardware is limited by
diminishing returns, and leads to inefficient use of additional resources. In
this paper, we present a large batch, stochastic optimization algorithm that is
both faster than widely used algorithms for fixed amounts of computation, and
also scales up substantially better as more computational resources become
available. Our algorithm implicitly computes the inverse Hessian of each
mini-batch to produce descent directions; we do so without either an explicit
approximation to the Hessian or Hessian-vector products. We demonstrate the
effectiveness of our algorithm by successfully training large ImageNet models
(Inception-V3, Resnet-50, Resnet-101 and Inception-Resnet-V2) with mini-batch
sizes of up to 32000 with no loss in validation error relative to current
baselines, and no increase in the total number of steps. At smaller mini-batch
sizes, our optimizer improves the validation error in these models by 0.8-0.9%.
Alternatively, we can trade off this accuracy to reduce the number of training
steps needed by roughly 10-30%. Our work is practical and easily usable by
others -- only one hyperparameter (learning rate) needs tuning, and
furthermore, the algorithm is as computationally cheap as the commonly used
Adam optimizer
Convergence rates of sub-sampled Newton methods
We consider the problem of minimizing a sum of functions over a convex
parameter set where . In this
regime, algorithms which utilize sub-sampling techniques are known to be
effective. In this paper, we use sub-sampling techniques together with low-rank
approximation to design a new randomized batch algorithm which possesses
comparable convergence rate to Newton's method, yet has much smaller
per-iteration cost. The proposed algorithm is robust in terms of starting point
and step size, and enjoys a composite convergence rate, namely, quadratic
convergence at start and linear convergence when the iterate is close to the
minimizer. We develop its theoretical analysis which also allows us to select
near-optimal algorithm parameters. Our theoretical results can be used to
obtain convergence rates of previously proposed sub-sampling based algorithms
as well. We demonstrate how our results apply to well-known machine learning
problems. Lastly, we evaluate the performance of our algorithm on several
datasets under various scenarios
A Stochastic Quasi-Newton Method for Large-Scale Optimization
The question of how to incorporate curvature information in stochastic
approximation methods is challenging. The direct application of classical
quasi- Newton updating techniques for deterministic optimization leads to noisy
curvature estimates that have harmful effects on the robustness of the
iteration. In this paper, we propose a stochastic quasi-Newton method that is
efficient, robust and scalable. It employs the classical BFGS update formula in
its limited memory form, and is based on the observation that it is beneficial
to collect curvature information pointwise, and at regular intervals, through
(sub-sampled) Hessian-vector products. This technique differs from the
classical approach that would compute differences of gradients, and where
controlling the quality of the curvature estimates can be difficult. We present
numerical results on problems arising in machine learning that suggest that the
proposed method shows much promise
Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods
This paper presents a finite difference quasi-Newton method for the
minimization of noisy functions. The method takes advantage of the scalability
and power of BFGS updating, and employs an adaptive procedure for choosing the
differencing interval based on the noise estimation techniques of Hamming
(2012) and Mor\'e and Wild (2011). This noise estimation procedure and the
selection of are inexpensive but not always accurate, and to prevent
failures the algorithm incorporates a recovery mechanism that takes appropriate
action in the case when the line search procedure is unable to produce an
acceptable point. A novel convergence analysis is presented that considers the
effect of a noisy line search procedure. Numerical experiments comparing the
method to a function interpolating trust region method are presented.Comment: 26 pages, 9 figure
Preconditioned Stochastic Gradient Descent
Stochastic gradient descent (SGD) still is the workhorse for many practical
problems. However, it converges slow, and can be difficult to tune. It is
possible to precondition SGD to accelerate its convergence remarkably. But many
attempts in this direction either aim at solving specialized problems, or
result in significantly more complicated methods than SGD. This paper proposes
a new method to estimate a preconditioner such that the amplitudes of
perturbations of preconditioned stochastic gradient match that of the
perturbations of parameters to be optimized in a way comparable to Newton
method for deterministic optimization. Unlike the preconditioners based on
secant equation fitting as done in deterministic quasi-Newton methods, which
assume positive definite Hessian and approximate its inverse, the new
preconditioner works equally well for both convex and non-convex optimizations
with exact or noisy gradients. When stochastic gradient is used, it can
naturally damp the gradient noise to stabilize SGD. Efficient preconditioner
estimation methods are developed, and with reasonable simplifications, they are
applicable to large scaled problems. Experimental results demonstrate that
equipped with the new preconditioner, without any tuning effort, preconditioned
SGD can efficiently solve many challenging problems like the training of a deep
neural network or a recurrent neural network requiring extremely long term
memories.Comment: 13 pages, 9 figures. To appear in IEEE Transactions on Neural
Networks and Learning Systems. Supplemental materials on
https://sites.google.com/site/lixilinx/home/psg
Variable Metric Stochastic Approximation Theory
We provide a variable metric stochastic approximation theory. In doing so, we
provide a convergence theory for a large class of online variable metric
methods including the recently introduced online versions of the BFGS algorithm
and its limited-memory LBFGS variant. We also discuss the implications of our
results for learning from expert advice.Comment: Correctment of theorem 3.4. from AISTATS 2009 articl
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
The goal of this tutorial is to introduce key models, algorithms, and open
questions related to the use of optimization methods for solving problems
arising in machine learning. It is written with an INFORMS audience in mind,
specifically those readers who are familiar with the basics of optimization
algorithms, but less familiar with machine learning. We begin by deriving a
formulation of a supervised learning problem and show how it leads to various
optimization problems, depending on the context and underlying assumptions. We
then discuss some of the distinctive features of these optimization problems,
focusing on the examples of logistic regression and the training of deep neural
networks. The latter half of the tutorial focuses on optimization algorithms,
first for convex logistic regression, for which we discuss the use of
first-order methods, the stochastic gradient method, variance reducing
stochastic methods, and second-order methods. Finally, we discuss how these
approaches can be employed to the training of deep neural networks, emphasizing
the difficulties that arise from the complex, nonconvex structure of these
models
A fast quasi-Newton-type method for large-scale stochastic optimisation
During recent years there has been an increased interest in stochastic
adaptations of limited memory quasi-Newton methods, which compared to pure
gradient-based routines can improve the convergence by incorporating second
order information. In this work we propose a direct least-squares approach
conceptually similar to the limited memory quasi-Newton methods, but that
computes the search direction in a slightly different way. This is achieved in
a fast and numerically robust manner by maintaining a Cholesky factor of low
dimension. This is combined with a stochastic line search relying upon
fulfilment of the Wolfe condition in a backtracking manner, where the step
length is adaptively modified with respect to the optimisation progress. We
support our new algorithm by providing several theoretical results guaranteeing
its performance. The performance is demonstrated on real-world benchmark
problems which shows improved results in comparison with already established
methods.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0431
Hybrid optimization and Bayesian inference techniques for a non-smooth radiation detection problem
In this investigation, we propose several algorithms to recover the location
and intensity of a radiation source located in a simulated 250 m x 180 m block
in an urban center based on synthetic measurements. Radioactive decay and
detection are Poisson random processes, so we employ likelihood functions based
on this distribution. Due to the domain geometry and the proposed response
model, the negative logarithm of the likelihood is only piecewise continuous
differentiable, and it has multiple local minima. To address these
difficulties, we investigate three hybrid algorithms comprised of mixed
optimization techniques. For global optimization, we consider Simulated
Annealing (SA), Particle Swarm (PS) and Genetic Algorithm (GA), which rely
solely on objective function evaluations; i.e., they do not evaluate the
gradient in the objective function. By employing early stopping criteria for
the global optimization methods, a pseudo-optimum point is obtained. This is
subsequently utilized as the initial value by the deterministic Implicit
Filtering method (IF), which is able to find local extrema in non-smooth
functions, to finish the search in a narrow domain. These new hybrid techniques
combining global optimization and Implicit Filtering address difficulties
associated with the non-smooth response, and their performances are shown to
significantly decrease the computational time over the global optimization
methods alone. To quantify uncertainties associated with the source location
and intensity, we employ the Delayed Rejection Adaptive Metropolis (DRAM) and
DiffeRential Evolution Adaptive Metropolis (DREAM) algorithms. Marginal
densities of the source properties are obtained, and the means of the chains'
compare accurately with the estimates produced by the hybrid algorithms.Comment: 36 pages, 14 figure
- …