4 research outputs found
A Quasi-Newton Method for Large Scale Support Vector Machines
This paper adapts a recently developed regularized stochastic version of the
Broyden, Fletcher, Goldfarb, and Shanno (BFGS) quasi-Newton method for the
solution of support vector machine classification problems. The proposed method
is shown to converge almost surely to the optimal classifier at a rate that is
linear in expectation. Numerical results show that the proposed method exhibits
a convergence rate that degrades smoothly with the dimensionality of the
feature vectors.Comment: 5 pages, To appear in International Conference on Acoustics, Speech,
and Signal Processing (ICASSP) 201
Sparse Bayesian Learning with Diagonal Quasi-Newton Method for Large Scale Classification
Sparse Bayesian Learning (SBL) constructs an extremely sparse probabilistic
model with very competitive generalization. However, SBL needs to invert a big
covariance matrix with complexity O(M^3 ) (M: feature size) for updating the
regularization priors, making it difficult for practical use. There are three
issues in SBL: 1) Inverting the covariance matrix may obtain singular solutions
in some cases, which hinders SBL from convergence; 2) Poor scalability to
problems with high dimensional feature space or large data size; 3) SBL easily
suffers from memory overflow for large-scale data. This paper addresses these
issues with a newly proposed diagonal Quasi-Newton (DQN) method for SBL called
DQN-SBL where the inversion of big covariance matrix is ignored so that the
complexity and memory storage are reduced to O(M). The DQN-SBL is thoroughly
evaluated on non-linear classifiers and linear feature selection using various
benchmark datasets of different sizes. Experimental results verify that DQN-SBL
receives competitive generalization with a very sparse model and scales well to
large-scale problems.Comment: 11 pages,5 figure
Efficient Methods For Large-Scale Empirical Risk Minimization
Empirical risk minimization (ERM) problems express optimal classifiers as solutions of optimization problems in which the objective is the sum of a very large number of sample costs. An evident obstacle in using traditional descent algorithms for solving this class of problems is their prohibitive computational complexity when the number of component functions in the ERM problem is large. The main goal of this thesis is to study different approaches to solve these large-scale ERM problems.
We begin by focusing on incremental and stochastic methods which split the training samples into smaller sets across time to lower the computation burden of traditional descent algorithms. We develop and analyze convergent stochastic variants of quasi-Newton methods which do not require computation of the objective Hessian and approximate the curvature using only gradient information. We show that the curvature approximation in stochastic quasi-Newton methods leads to faster convergence relative to first-order stochastic methods when the problem is ill-conditioned. We culminate with the introduction of an incremental method that exploits memory to achieve a superlinear convergence rate. This is the best known convergence rate for an incremental method.
An alternative strategy for lowering the prohibitive cost of solving large-scale ERM problems is decentralized optimization whereby samples are separated not across time but across multiple nodes of a network. In this regime, the main contribution of this thesis is in incorporating second-order information of the aggregate risk corresponding to samples of all nodes in the network in a way that can be implemented in a distributed fashion. We also explore the separation of samples across both, time and space, to reduce the computational and communication cost for solving large-scale ERM problems. We study this path by introducing a decentralized stochastic method which incorporates the idea of stochastic averaging gradient leading to a low computational complexity method with a fast linear convergence rate.
We then introduce a rethinking of ERM in which we consider not a partition of the training set as in the case of stochastic and distributed optimization, but a nested collection of subsets that we grow geometrically. The key insight is that the optimal argument associated with a training subset of a certain size is not that far from the optimal argument associated with a larger training subset. Based on this insight, we present adaptive sample size schemes which start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. We further show that if we couple the adaptive sample size scheme with Newton\u27s method, it is possible to consider subsequent doubling of the training set and perform a single Newton iteration in between. This is possible because of the interplay between the statistical accuracy and the quadratic convergence region of these problems and yields a method that is guaranteed to solve an ERM problem by performing just two passes over the dataset