869 research outputs found
Scalable Kernel Methods via Doubly Stochastic Gradients
The general perception is that kernel methods are not scalable, and neural
nets are the methods of choice for nonlinear learning problems. Or have we
simply not tried hard enough for kernel methods? Here we propose an approach
that scales up kernel methods using a novel concept called "doubly stochastic
functional gradients". Our approach relies on the fact that many kernel methods
can be expressed as convex optimization problems, and we solve the problems by
making two unbiased stochastic approximations to the functional gradient, one
using random training points and another using random functions associated with
the kernel, and then descending using this noisy functional gradient. We show
that a function produced by this procedure after iterations converges to
the optimal function in the reproducing kernel Hilbert space in rate ,
and achieves a generalization performance of . This doubly
stochasticity also allows us to avoid keeping the support vectors and to
implement the algorithm in a small memory footprint, which is linear in number
of iterations and independent of data dimension. Our approach can readily scale
kernel methods up to the regimes which are dominated by neural nets. We show
that our method can achieve competitive performance to neural nets in datasets
such as 8 million handwritten digits from MNIST, 2.3 million energy materials
from MolecularSpace, and 1 million photos from ImageNet.Comment: 32 pages, 22 figure
Hyperparameter optimization with approximate gradient
Most models in machine learning contain at least one hyperparameter to
control for model complexity. Choosing an appropriate set of hyperparameters is
both crucial in terms of model accuracy and computationally challenging. In
this work we propose an algorithm for the optimization of continuous
hyperparameters using inexact gradient information. An advantage of this method
is that hyperparameters can be updated before model parameters have fully
converged. We also give sufficient conditions for the global convergence of
this method, based on regularity conditions of the involved functions and
summability of errors. Finally, we validate the empirical performance of this
method on the estimation of regularization constants of L2-regularized logistic
regression and kernel Ridge regression. Empirical benchmarks indicate that our
approach is highly competitive with respect to state of the art methods.Comment: Proceedings of the International conference on Machine Learning
(ICML
Predicting diabetes-related hospitalizations based on electronic health records
OBJECTIVE: To derive a predictive model to identify patients likely to be hospitalized during the following year due to complications attributed to Type II diabetes. METHODS: A variety of supervised machine learning classification methods were tested and a new method that discovers hidden patient clusters in the positive class (hospitalized) was developed while, at the same time, sparse linear support vector machine classifiers were derived to separate positive samples from the negative ones (non-hospitalized). The convergence of the new method was established and theoretical guarantees were proved on how the classifiers it produces generalize to a test set not seen during training. RESULTS: The methods were tested on a large set of patients from the Boston Medical Center - the largest safety net hospital in New England. It is found that our new joint clustering/classification method achieves an accuracy of 89% (measured in terms of area under the ROC Curve) and yields informative clusters which can help interpret the classification results, thus increasing the trust of physicians to the algorithmic output and providing some guidance towards preventive measures. While it is possible to increase accuracy to 92% with other methods, this comes with increased computational cost and lack of interpretability. The analysis shows that even a modest probability of preventive actions being effective (more than 19%) suffices to generate significant hospital care savings. CONCLUSIONS: Predictive models are proposed that can help avert hospitalizations, improve health outcomes and drastically reduce hospital expenditures. The scope for savings is significant as it has been estimated that in the USA alone, about $5.8 billion are spent each year on diabetes-related hospitalizations that could be prevented.Accepted manuscrip
Recurrent kernel machines : computing with infinite echo state networks
Echo state networks (ESNs) are large, random recurrent neural networks with a single trained linear readout layer. Despite the untrained nature of the recurrent weights, they are capable of performing universal computations on temporal input data, which makes them interesting for both theoretical research and practical applications. The key to their success lies in the fact that the network computes a broad set of nonlinear, spatiotemporal mappings of the input data, on which linear regression or classification can easily be performed. One could consider the reservoir as a spatiotemporal kernel, in which the mapping to a high-dimensional space is computed explicitly. In this letter, we build on this idea and extend the concept of ESNs to infinite-sized recurrent neural networks, which can be considered recursive kernels that subsequently can be used to create recursive support vector machines. We present the theoretical framework, provide several practical examples of recursive kernels, and apply them to typical temporal tasks
Variational Hamiltonian Monte Carlo via Score Matching
Traditionally, the field of computational Bayesian statistics has been
divided into two main subfields: variational methods and Markov chain Monte
Carlo (MCMC). In recent years, however, several methods have been proposed
based on combining variational Bayesian inference and MCMC simulation in order
to improve their overall accuracy and computational efficiency. This marriage
of fast evaluation and flexible approximation provides a promising means of
designing scalable Bayesian inference methods. In this paper, we explore the
possibility of incorporating variational approximation into a state-of-the-art
MCMC method, Hamiltonian Monte Carlo (HMC), to reduce the required gradient
computation in the simulation of Hamiltonian flow, which is the bottleneck for
many applications of HMC in big data problems. To this end, we use a {\it
free-form} approximation induced by a fast and flexible surrogate function
based on single-hidden layer feedforward neural networks. The surrogate
provides sufficiently accurate approximation while allowing for fast
exploration of parameter space, resulting in an efficient approximate inference
algorithm. We demonstrate the advantages of our method on both synthetic and
real data problems
- …