4 research outputs found

    A Quasi-Newton Method for Large Scale Support Vector Machines

    Full text link
    This paper adapts a recently developed regularized stochastic version of the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) quasi-Newton method for the solution of support vector machine classification problems. The proposed method is shown to converge almost surely to the optimal classifier at a rate that is linear in expectation. Numerical results show that the proposed method exhibits a convergence rate that degrades smoothly with the dimensionality of the feature vectors.Comment: 5 pages, To appear in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 201

    Sparse Bayesian Learning with Diagonal Quasi-Newton Method for Large Scale Classification

    Full text link
    Sparse Bayesian Learning (SBL) constructs an extremely sparse probabilistic model with very competitive generalization. However, SBL needs to invert a big covariance matrix with complexity O(M^3 ) (M: feature size) for updating the regularization priors, making it difficult for practical use. There are three issues in SBL: 1) Inverting the covariance matrix may obtain singular solutions in some cases, which hinders SBL from convergence; 2) Poor scalability to problems with high dimensional feature space or large data size; 3) SBL easily suffers from memory overflow for large-scale data. This paper addresses these issues with a newly proposed diagonal Quasi-Newton (DQN) method for SBL called DQN-SBL where the inversion of big covariance matrix is ignored so that the complexity and memory storage are reduced to O(M). The DQN-SBL is thoroughly evaluated on non-linear classifiers and linear feature selection using various benchmark datasets of different sizes. Experimental results verify that DQN-SBL receives competitive generalization with a very sparse model and scales well to large-scale problems.Comment: 11 pages,5 figure

    Efficient Methods For Large-Scale Empirical Risk Minimization

    Get PDF
    Empirical risk minimization (ERM) problems express optimal classifiers as solutions of optimization problems in which the objective is the sum of a very large number of sample costs. An evident obstacle in using traditional descent algorithms for solving this class of problems is their prohibitive computational complexity when the number of component functions in the ERM problem is large. The main goal of this thesis is to study different approaches to solve these large-scale ERM problems. We begin by focusing on incremental and stochastic methods which split the training samples into smaller sets across time to lower the computation burden of traditional descent algorithms. We develop and analyze convergent stochastic variants of quasi-Newton methods which do not require computation of the objective Hessian and approximate the curvature using only gradient information. We show that the curvature approximation in stochastic quasi-Newton methods leads to faster convergence relative to first-order stochastic methods when the problem is ill-conditioned. We culminate with the introduction of an incremental method that exploits memory to achieve a superlinear convergence rate. This is the best known convergence rate for an incremental method. An alternative strategy for lowering the prohibitive cost of solving large-scale ERM problems is decentralized optimization whereby samples are separated not across time but across multiple nodes of a network. In this regime, the main contribution of this thesis is in incorporating second-order information of the aggregate risk corresponding to samples of all nodes in the network in a way that can be implemented in a distributed fashion. We also explore the separation of samples across both, time and space, to reduce the computational and communication cost for solving large-scale ERM problems. We study this path by introducing a decentralized stochastic method which incorporates the idea of stochastic averaging gradient leading to a low computational complexity method with a fast linear convergence rate. We then introduce a rethinking of ERM in which we consider not a partition of the training set as in the case of stochastic and distributed optimization, but a nested collection of subsets that we grow geometrically. The key insight is that the optimal argument associated with a training subset of a certain size is not that far from the optimal argument associated with a larger training subset. Based on this insight, we present adaptive sample size schemes which start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. We further show that if we couple the adaptive sample size scheme with Newton\u27s method, it is possible to consider subsequent doubling of the training set and perform a single Newton iteration in between. This is possible because of the interplay between the statistical accuracy and the quadratic convergence region of these problems and yields a method that is guaranteed to solve an ERM problem by performing just two passes over the dataset
    corecore