4,547 research outputs found
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
The goal of this tutorial is to introduce key models, algorithms, and open
questions related to the use of optimization methods for solving problems
arising in machine learning. It is written with an INFORMS audience in mind,
specifically those readers who are familiar with the basics of optimization
algorithms, but less familiar with machine learning. We begin by deriving a
formulation of a supervised learning problem and show how it leads to various
optimization problems, depending on the context and underlying assumptions. We
then discuss some of the distinctive features of these optimization problems,
focusing on the examples of logistic regression and the training of deep neural
networks. The latter half of the tutorial focuses on optimization algorithms,
first for convex logistic regression, for which we discuss the use of
first-order methods, the stochastic gradient method, variance reducing
stochastic methods, and second-order methods. Finally, we discuss how these
approaches can be employed to the training of deep neural networks, emphasizing
the difficulties that arise from the complex, nonconvex structure of these
models
Stochastic Trust Region Methods with Trust Region Radius Depending on Probabilistic Models
We present a stochastic trust-region model-based framework in which its
radius is related to the probabilistic models. Especially, we propose a
specific algorithm, termed STRME, in which the trust-region radius depends
linearly on the latest model gradient. The complexity of STRME method in
non-convex, convex and strongly convex settings has all been analyzed, which
matches the existing algorithms based on probabilistic properties. In addition,
several numerical experiments are carried out to reveal the benefits of the
proposed methods compared to the existing stochastic trust-region methods and
other relevant stochastic gradient methods
A Semismooth Newton Method for Support Vector Classification and Regression
Support vector machine is an important and fundamental technique in machine
learning. In this paper, we apply a semismooth Newton method to solve two
typical SVM models: the L2-loss SVC model and the \epsilon-L2-loss SVR model.
The semismooth Newton method is widely used in optimization community. A common
belief on the semismooth Newton method is its fast convergence rate as well as
high computational complexity. Our contribution in this paper is that by
exploring the sparse structure of the models, we significantly reduce the
computational complexity, meanwhile keeping the quadratic convergence rate.
Extensive numerical experiments demonstrate the outstanding performance of the
semismooth Newton method, especially for problems with huge size of sample data
(for news20.binary problem with 19996 features and 1355191 samples, it only
takes three seconds). In particular, for the \epsilon-L2-loss SVR model, the
semismooth Newton method significantly outperforms the leading solvers
including DCD and TRON
A Unified Batch Online Learning Framework for Click Prediction
We present a unified framework for Batch Online Learning (OL) for Click
Prediction in Search Advertisement. Machine Learning models once deployed, show
non-trivial accuracy and calibration degradation over time due to model
staleness. It is therefore necessary to regularly update models, and do so
automatically. This paper presents two paradigms of Batch Online Learning, one
which incrementally updates the model parameters via an early stopping
mechanism, and another which does so through a proximal regularization. We
argue how both these schemes naturally trade-off between old and new data. We
then theoretically and empirically show that these two seemingly different
schemes are closely related. Through extensive experiments, we demonstrate the
utility of of our OL framework; how the two OL schemes relate to each other and
how they trade-off between the new and historical data. We then compare batch
OL to full model retrains, and show how online learning is more robust to data
issues. We also demonstrate the long term impact of Online Learning, the role
of the initial Models in OL, the impact of delays in the update, and finally
conclude with some implementation details and challenges in deploying a real
world online learning system in production. While this paper mostly focuses on
application of click prediction for search advertisement, we hope that the
lessons learned here can be carried over to other problem domains
Parallel Coordinate Descent Newton Method for Efficient -Regularized Minimization
The recent years have witnessed advances in parallel algorithms for large
scale optimization problems. Notwithstanding demonstrated success, existing
algorithms that parallelize over features are usually limited by divergence
issues under high parallelism or require data preprocessing to alleviate these
problems. In this work, we propose a Parallel Coordinate Descent Newton
algorithm using multidimensional approximate Newton steps (PCDN), where the
off-diagonal elements of the Hessian are set to zero to enable parallelization.
It randomly partitions the feature set into bundles/subsets with size of
, and sequentially processes each bundle by first computing the descent
directions for each feature in parallel and then conducting -dimensional
line search to obtain the step size. We show that: (1) PCDN is guaranteed to
converge globally despite increasing parallelism; (2) PCDN converges to the
specified accuracy within the limited iteration number of
, and decreases with increasing parallelism (bundle
size ). Using the implementation technique of maintaining intermediate
quantities, we minimize the data transfer and synchronization cost of the
-dimensional line search. For concreteness, the proposed PCDN algorithm is
applied to -regularized logistic regression and -loss SVM.
Experimental evaluations on six benchmark datasets show that the proposed PCDN
algorithm exploits parallelism well and outperforms the state-of-the-art
methods in speed without losing accuracy
Vandalism Detection in Wikipedia: a Bag-of-Words Classifier Approach
A bag-of-words based probabilistic classifier is trained using regularized
logistic regression to detect vandalism in the English Wikipedia. Isotonic
regression is used to calibrate the class membership probabilities. Learning
curve, reliability, ROC, and cost analysis are performed.Comment: 15 pages, 5 figure
Fast Black-box Variational Inference through Stochastic Trust-Region Optimization
We introduce TrustVI, a fast second-order algorithm for black-box variational
inference based on trust-region optimization and the reparameterization trick.
At each iteration, TrustVI proposes and assesses a step based on minibatches of
draws from the variational distribution. The algorithm provably converges to a
stationary point. We implemented TrustVI in the Stan framework and compared it
to two alternatives: Automatic Differentiation Variational Inference (ADVI) and
Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is
based on stochastic first-order optimization. The latter uses second-order
information, but lacks convergence guarantees. TrustVI typically converged at
least one order of magnitude faster than ADVI, demonstrating the value of
stochastic second-order information. TrustVI often found substantially better
variational distributions than HFSGVI, demonstrating that our convergence
theory can matter in practice.Comment: NIPS 2017 camera-read
Data augmentation for non-Gaussian regression models using variance-mean mixtures
We use the theory of normal variance-mean mixtures to derive a
data-augmentation scheme for a class of common regularization problems. This
generalizes existing theory on normal variance mixtures for priors in
regression and classification. It also allows variants of the
expectation-maximization algorithm to be brought to bear on a wider range of
models than previously appreciated. We demonstrate the method on several
examples, including sparse quantile regression and binary logistic regression.
We also show that quasi-Newton acceleration can substantially improve the speed
of the algorithm without compromising its robustness.Comment: Added a discussion of quasi-Newton acceleratio
A comparison of linear and non-linear calibrations for speaker recognition
In recent work on both generative and discriminative score to
log-likelihood-ratio calibration, it was shown that linear transforms give good
accuracy only for a limited range of operating points. Moreover, these methods
required tailoring of the calibration training objective functions in order to
target the desired region of best accuracy. Here, we generalize the linear
recipes to non-linear ones. We experiment with a non-linear, non-parametric,
discriminative PAV solution, as well as parametric, generative,
maximum-likelihood solutions that use Gaussian, Student's T and
normal-inverse-Gaussian score distributions. Experiments on NIST SRE'12 scores
suggest that the non-linear methods provide wider ranges of optimal accuracy
and can be trained without having to resort to objective function tailoring.Comment: accepted for Odyssey 2014: The Speaker and Language Recognition
Worksho
Indefinite Kernel Logistic Regression with Concave-inexact-convex Procedure
In kernel methods, the kernels are often required to be positive definite,
which restricts the use of many indefinite kernels. To consider those
non-positive definite kernels, in this paper, we aim to build an indefinite
kernel learning framework for kernel logistic regression. The proposed
indefinite kernel logistic regression (IKLR) model is analysed in the
Reproducing Kernel Kre\u{\i}n Spaces (RKKS) and then becomes non-convex. Using
the positive decomposition of a non-positive definite kernel, the derived IKLR
model can be decomposed into the difference of two convex functions.
Accordingly, a concave-convex procedure is introduced to solve the non-convex
optimization problem. Since the concave-convex procedure has to solve a
sub-problem in each iteration, we propose a concave-inexact-convex procedure
(CCICP) algorithm with an inexact solving scheme to accelerate the solving
process. Besides, we propose a stochastic variant of CCICP to efficiently
obtain a proximal solution, which achieves the similar purpose with the inexact
solving scheme in CCICP. The convergence analyses of the above two variants of
concave-convex procedure are conducted. By doing so, our method works
effectively not only under a deterministic setting but also under a stochastic
setting. Experimental results on several benchmarks suggest that the proposed
IKLR model performs favorably against the standard (positive-definite) kernel
logistic regression and other competitive indefinite learning based algorithms
- …