2,698 research outputs found
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
The goal of this tutorial is to introduce key models, algorithms, and open
questions related to the use of optimization methods for solving problems
arising in machine learning. It is written with an INFORMS audience in mind,
specifically those readers who are familiar with the basics of optimization
algorithms, but less familiar with machine learning. We begin by deriving a
formulation of a supervised learning problem and show how it leads to various
optimization problems, depending on the context and underlying assumptions. We
then discuss some of the distinctive features of these optimization problems,
focusing on the examples of logistic regression and the training of deep neural
networks. The latter half of the tutorial focuses on optimization algorithms,
first for convex logistic regression, for which we discuss the use of
first-order methods, the stochastic gradient method, variance reducing
stochastic methods, and second-order methods. Finally, we discuss how these
approaches can be employed to the training of deep neural networks, emphasizing
the difficulties that arise from the complex, nonconvex structure of these
models
Primal-Dual Active-Set Methods for Isotonic Regression and Trend Filtering
Isotonic regression (IR) is a non-parametric calibration method used in
supervised learning. For performing large-scale IR, we propose a primal-dual
active-set (PDAS) algorithm which, in contrast to the state-of-the-art Pool
Adjacent Violators (PAV) algorithm, can be parallized and is easily
warm-started thus well-suited in the online settings. We prove that, like the
PAV algorithm, our PDAS algorithm for IR is convergent and has a work
complexity of O(n), though our numerical experiments suggest that our PDAS
algorithm is often faster than PAV. In addition, we propose PDAS variants (with
safeguarding to ensure convergence) for solving related trend filtering (TF)
problems, providing the results of experiments to illustrate their
effectiveness
Regional Complexity Analysis of Algorithms for Nonconvex Smooth Optimization
A strategy is proposed for characterizing the worst-case performance of
algorithms for solving nonconvex smooth optimization problems. Contemporary
analyses characterize worst-case performance by providing, under certain
assumptions on an objective function, an upper bound on the number of
iterations (or function or derivative evaluations) required until a pth-order
stationarity condition is approximately satisfied. This arguably leads to
conservative characterizations based on anomalous objectives rather than on
ones that are typically encountered in practice. By contrast, the strategy
proposed in this paper characterizes worst-case performance separately over
regions comprising a search space. These regions are defined generically based
on properties of derivative values. In this manner, one can analyze the
worst-case performance of an algorithm independently from any particular class
of objectives. Then, once given a class of objectives, one can obtain an
informative, fine-tuned complexity analysis merely by delineating the types of
regions that comprise the search spaces for functions in the class. Regions
defined by first- and second-order derivatives are discussed in detail and
example complexity analyses are provided for a few fundamental first- and
second-order algorithms when employed to minimize convex and nonconvex
objectives of interest. It is also explained how the strategy can be
generalized to regions defined by higher-order derivatives and for analyzing
the behavior of higher-order algorithms
Exploiting Negative Curvature in Deterministic and Stochastic Optimization
This paper addresses the question of whether it can be beneficial for an
optimization algorithm to follow directions of negative curvature. Although
prior work has established convergence results for algorithms that integrate
both descent and negative curvature steps, there has not yet been extensive
numerical evidence showing that such methods offer consistent performance
improvements. In this paper, we present new frameworks for combining descent
and negative curvature directions: alternating two-step approaches and dynamic
step approaches. The aspect that distinguishes our approaches from ones
previously proposed is that they make algorithmic decisions based on
(estimated) upper-bounding models of the objective function. A consequence of
this aspect is that our frameworks can, in theory, employ fixed stepsizes,
which makes the methods readily translatable from deterministic to stochastic
settings. For deterministic problems, we show that instances of our dynamic
framework yield gains in performance compared to related methods that only
follow descent steps. We also show that gains can be made in a stochastic
setting in cases when a standard stochastic-gradient-type method might make
slow progress
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
ADMM for Multiaffine Constrained Optimization
We expand the scope of the alternating direction method of multipliers
(ADMM). Specifically, we show that ADMM, when employed to solve problems with
multiaffine constraints that satisfy certain verifiable assumptions, converges
to the set of constrained stationary points if the penalty parameter in the
augmented Lagrangian is sufficiently large. When the Kurdyka-\L{}ojasiewicz
(K-\L{}) property holds, this is strengthened to convergence to a single
constrained stationary point. Our analysis applies under assumptions that we
have endeavored to make as weak as possible. It applies to problems that
involve nonconvex and/or nonsmooth objective terms, in addition to the
multiaffine constraints that can involve multiple (three or more) blocks of
variables. To illustrate the applicability of our results, we describe examples
including nonnegative matrix factorization, sparse learning, risk parity
portfolio selection, nonconvex formulations of convex problems, and neural
network training. In each case, our ADMM approach encounters only subproblems
that have closed-form solutions.Comment: v3: 37 pages, 7 figures v2: 32 pages, 0 figures. v1: 26 pages, 0
figure
A Stochastic Trust Region Algorithm Based on Careful Step Normalization
An algorithm is proposed for solving stochastic and finite sum minimization
problems. Based on a trust region methodology, the algorithm employs normalized
steps, at least as long as the norms of the stochastic gradient estimates are
within a specified interval. The complete algorithm---which dynamically chooses
whether or not to employ normalized steps---is proved to have convergence
guarantees that are similar to those possessed by a traditional stochastic
gradient approach under various sets of conditions related to the accuracy of
the stochastic gradient estimates and choice of stepsize sequence. The results
of numerical experiments are presented when the method is employed to minimize
convex and nonconvex machine learning test problems. These results illustrate
that the method can outperform a traditional stochastic gradient approach
A Reduced-Space Algorithm for Minimizing -Regularized Convex Functions
We present a new method for minimizing the sum of a differentiable convex
function and an -norm regularizer. The main features of the new method
include: an evolving set of indices corresponding to variables that are
predicted to be nonzero at a solution (i.e., the support); a
reduced-space subproblem defined in terms of the predicted support;
conditions that determine how accurately each subproblem must be solved, which
allow for Newton, Newton-CG, and coordinate-descent techniques to be employed;
a computationally practical condition that determines when the predicted
support should be updated; and a reduced proximal gradient step that
ensures sufficient decrease in the objective function when it is decided that
variables should be added to the predicted support. We prove a convergence
guarantee for our method and demonstrate its efficiency on a large set of model
prediction problems
Complexity Analysis of a Trust Funnel Algorithm for Equality Constrained Optimization
A method is proposed for solving equality constrained nonlinear optimization
problems involving twice continuously differentiable functions. The method
employs a trust funnel approach consisting of two phases: a first phase to
locate an -feasible point and a second phase to seek optimality while
maintaining at least -feasibility. A two-phase approach of this kind
based on a cubic regularization methodology was recently proposed along with a
supporting worst-case iteration complexity analysis. Unfortunately, however, in
that approach, the objective function is completely ignored in the first phase
when -feasibility is sought. The main contribution of the method
proposed in this paper is that the same worst-case iteration complexity is
achieved, but with a first phase that also accounts for improvements in the
objective function. As such, the method typically requires fewer iterations in
the second phase, as the results of numerical experiments demonstrate
An Inexact Regularized Newton Framework with a Worst-Case Iteration Complexity of for Nonconvex Optimization
An algorithm for solving smooth nonconvex optimization problems is proposed
that, in the worst-case, takes iterations to
drive the norm of the gradient of the objective function below a prescribed
positive real number and can take
iterations to drive the leftmost eigenvalue of the Hessian of the objective
above . The proposed algorithm is a general framework that covers a
wide range of techniques including quadratically and cubically regularized
Newton methods, such as the Adaptive Regularisation using Cubics (ARC) method
and the recently proposed Trust-Region Algorithm with Contractions and
Expansions (TRACE). The generality of our method is achieved through the
introduction of generic conditions that each trial step is required to satisfy,
which in particular allow for inexact regularized Newton steps to be used.
These conditions center around a new subproblem that can be approximately
solved to obtain trial steps that satisfy the conditions. A new instance of the
framework, distinct from ARC and TRACE, is described that may be viewed as a
hybrid between quadratically and cubically regularized Newton methods.
Numerical results demonstrate that our hybrid algorithm outperforms a cublicly
regularized Newton method
- …