Search CORE

4,547 research outputs found

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

Author: Curtis Frank E.
Scheinberg Katya
Publication venue
Publication date: 30/06/2017
Field of study

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models

arXiv.org e-Print Archive

Stochastic Trust Region Methods with Trust Region Radius Depending on Probabilistic Models

Author: Wang Xiaoyu
Yuan Ya-xiang
Publication venue
Publication date: 13/09/2019
Field of study

We present a stochastic trust-region model-based framework in which its radius is related to the probabilistic models. Especially, we propose a specific algorithm, termed STRME, in which the trust-region radius depends linearly on the latest model gradient. The complexity of STRME method in non-convex, convex and strongly convex settings has all been analyzed, which matches the existing algorithms based on probabilistic properties. In addition, several numerical experiments are carried out to reveal the benefits of the proposed methods compared to the existing stochastic trust-region methods and other relevant stochastic gradient methods

arXiv.org e-Print Archive

A Semismooth Newton Method for Support Vector Classification and Regression

Author: Li Qingna
Yin Juan
Publication venue
Publication date: 01/03/2019
Field of study

Support vector machine is an important and fundamental technique in machine learning. In this paper, we apply a semismooth Newton method to solve two typical SVM models: the L2-loss SVC model and the \epsilon-L2-loss SVR model. The semismooth Newton method is widely used in optimization community. A common belief on the semismooth Newton method is its fast convergence rate as well as high computational complexity. Our contribution in this paper is that by exploring the sparse structure of the models, we significantly reduce the computational complexity, meanwhile keeping the quadratic convergence rate. Extensive numerical experiments demonstrate the outstanding performance of the semismooth Newton method, especially for problems with huge size of sample data (for news20.binary problem with 19996 features and 1355191 samples, it only takes three seconds). In particular, for the \epsilon-L2-loss SVR model, the semismooth Newton method significantly outperforms the leading solvers including DCD and TRON

arXiv.org e-Print Archive

A Unified Batch Online Learning Framework for Click Prediction

Author: Acharya Nimit
Bompada Tanuja
Charles Denis
Iyer Rishabh
Manavoglu Eren
Publication venue
Publication date: 12/09/2018
Field of study

We present a unified framework for Batch Online Learning (OL) for Click Prediction in Search Advertisement. Machine Learning models once deployed, show non-trivial accuracy and calibration degradation over time due to model staleness. It is therefore necessary to regularly update models, and do so automatically. This paper presents two paradigms of Batch Online Learning, one which incrementally updates the model parameters via an early stopping mechanism, and another which does so through a proximal regularization. We argue how both these schemes naturally trade-off between old and new data. We then theoretically and empirically show that these two seemingly different schemes are closely related. Through extensive experiments, we demonstrate the utility of of our OL framework; how the two OL schemes relate to each other and how they trade-off between the new and historical data. We then compare batch OL to full model retrains, and show how online learning is more robust to data issues. We also demonstrate the long term impact of Online Learning, the role of the initial Models in OL, the impact of delays in the update, and finally conclude with some implementation details and challenges in deploying a real world online learning system in production. While this paper mostly focuses on application of click prediction for search advertisement, we hope that the lessons learned here can be carried over to other problem domains

arXiv.org e-Print Archive

Parallel Coordinate Descent Newton Method for Efficient $\ell_1$ -Regularized Minimization

Author: Bian An
Li Xiong
Liu Yuncai
Yang Ming-Hsuan
Publication venue
Publication date: 07/12/2017
Field of study

The recent years have witnessed advances in parallel algorithms for large scale optimization problems. Notwithstanding demonstrated success, existing algorithms that parallelize over features are usually limited by divergence issues under high parallelism or require data preprocessing to alleviate these problems. In this work, we propose a Parallel Coordinate Descent Newton algorithm using multidimensional approximate Newton steps (PCDN), where the off-diagonal elements of the Hessian are set to zero to enable parallelization. It randomly partitions the feature set into

b

bundles/subsets with size of

P

, and sequentially processes each bundle by first computing the descent directions for each feature in parallel and then conducting

P

-dimensional line search to obtain the step size. We show that: (1) PCDN is guaranteed to converge globally despite increasing parallelism; (2) PCDN converges to the specified accuracy

\epsilon

within the limited iteration number of

T_\epsilon

, and

T_\epsilon

decreases with increasing parallelism (bundle size

P

). Using the implementation technique of maintaining intermediate quantities, we minimize the data transfer and synchronization cost of the

P

-dimensional line search. For concreteness, the proposed PCDN algorithm is applied to

\ell_1

-regularized logistic regression and

\ell_2

-loss SVM. Experimental evaluations on six benchmark datasets show that the proposed PCDN algorithm exploits parallelism well and outperforms the state-of-the-art methods in speed without losing accuracy

arXiv.org e-Print Archive

Vandalism Detection in Wikipedia: a Bag-of-Words Classifier Approach

Author: Belani Amit
Publication venue
Publication date: 05/01/2010
Field of study

A bag-of-words based probabilistic classifier is trained using regularized logistic regression to detect vandalism in the English Wikipedia. Isotonic regression is used to calibrate the class membership probabilities. Learning curve, reliability, ROC, and cost analysis are performed.Comment: 15 pages, 5 figure

arXiv.org e-Print Archive

Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

Author: Jordan Michael I.
McAuliffe Jon
Regier Jeffrey
Publication venue
Publication date: 04/11/2017
Field of study

We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice.Comment: NIPS 2017 camera-read

arXiv.org e-Print Archive

Data augmentation for non-Gaussian regression models using variance-mean mixtures

Author: Polson Nicholas G.
Scott James G.
Publication venue
Publication date: 22/09/2012
Field of study

We use the theory of normal variance-mean mixtures to derive a data-augmentation scheme for a class of common regularization problems. This generalizes existing theory on normal variance mixtures for priors in regression and classification. It also allows variants of the expectation-maximization algorithm to be brought to bear on a wider range of models than previously appreciated. We demonstrate the method on several examples, including sparse quantile regression and binary logistic regression. We also show that quasi-Newton acceleration can substantially improve the speed of the algorithm without compromising its robustness.Comment: Added a discussion of quasi-Newton acceleratio

arXiv.org e-Print Archive

CiteSeerX

A comparison of linear and non-linear calibrations for speaker recognition

Author: Brümmer Niko
Swart Albert
van Leeuwen David
Publication venue
Publication date: 01/01/2014
Field of study

In recent work on both generative and discriminative score to log-likelihood-ratio calibration, it was shown that linear transforms give good accuracy only for a limited range of operating points. Moreover, these methods required tailoring of the calibration training objective functions in order to target the desired region of best accuracy. Here, we generalize the linear recipes to non-linear ones. We experiment with a non-linear, non-parametric, discriminative PAV solution, as well as parametric, generative, maximum-likelihood solutions that use Gaussian, Student's T and normal-inverse-Gaussian score distributions. Experiments on NIST SRE'12 scores suggest that the non-linear methods provide wider ranges of optimal accuracy and can be trained without having to resort to objective function tailoring.Comment: accepted for Odyssey 2014: The Speaker and Language Recognition Worksho

arXiv.org e-Print Archive

Indefinite Kernel Logistic Regression with Concave-inexact-convex Procedure

Author: Gong Chen
Huang Xiaolin
Liu Fanghui
Suykens Johan A. K.
Yang Jie
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/05/2020
Field of study

In kernel methods, the kernels are often required to be positive definite, which restricts the use of many indefinite kernels. To consider those non-positive definite kernels, in this paper, we aim to build an indefinite kernel learning framework for kernel logistic regression. The proposed indefinite kernel logistic regression (IKLR) model is analysed in the Reproducing Kernel Kre\u{\i}n Spaces (RKKS) and then becomes non-convex. Using the positive decomposition of a non-positive definite kernel, the derived IKLR model can be decomposed into the difference of two convex functions. Accordingly, a concave-convex procedure is introduced to solve the non-convex optimization problem. Since the concave-convex procedure has to solve a sub-problem in each iteration, we propose a concave-inexact-convex procedure (CCICP) algorithm with an inexact solving scheme to accelerate the solving process. Besides, we propose a stochastic variant of CCICP to efficiently obtain a proximal solution, which achieves the similar purpose with the inexact solving scheme in CCICP. The convergence analyses of the above two variants of concave-convex procedure are conducted. By doing so, our method works effectively not only under a deterministic setting but also under a stochastic setting. Experimental results on several benchmarks suggest that the proposed IKLR model performs favorably against the standard (positive-definite) kernel logistic regression and other competitive indefinite learning based algorithms

arXiv.org e-Print Archive