107 research outputs found
Spectrum-Revealing Cholesky Factorization for Kernel Methods
Kernel methods represent some of the most popular machine learning tools for
data analysis. Since exact kernel methods can be prohibitively expensive for
large problems, reliable low-rank matrix approximations and high-performance
implementations have become indispensable for practical applications of kernel
methods. In this work, we introduce spectrum-revealing Cholesky factorization,
a reliable low-rank matrix factorization, for kernel matrix approximation. We
also develop an efficient and effective randomized algorithm for computing this
factorization. Our numerical experiments demonstrate that this algorithm is as
effective as other Cholesky factorization based kernel methods on machine
learning problems, but significantly faster.Comment: 7 pages, 8 figures, accepted by 2016 IEEE 16th International
Conference on Data Minin
Exponential Families for Conditional Random Fields
In this paper we de ne conditional random elds in reproducing kernel Hilbert
spaces and show connections to Gaussian Process classi cation. More speci
cally, we prove decomposition results for undirected graphical models and we
give constructions for kernels. Finally we present e cient means of solving the
optimization problem using reduced rank decompositions and we show how
stationarity can be exploited e ciently in the optimization process.Comment: Appears in Proceedings of the Twentieth Conference on Uncertainty in
Artificial Intelligence (UAI2004
Explicit Approximations of the Gaussian Kernel
We investigate training and using Gaussian kernel SVMs by approximating the
kernel with an explicit finite- dimensional polynomial feature representation
based on the Taylor expansion of the exponential. Although not as efficient as
the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms
of the number of features, we show how this polynomial representation can
provide a better approximation in terms of the computational cost involved.
This makes our "Taylor features" especially attractive for use on very large
data sets, in conjunction with online or stochastic training.Comment: 11 pages, 2 tables, 2 figure
Scalable Multilabel Prediction via Randomized Methods
Modeling the dependence between outputs is a fundamental challenge in
multilabel classification. In this work we show that a generic regularized
nonlinearity mapping independent predictions to joint predictions is sufficient
to achieve state-of-the-art performance on a variety of benchmark problems.
Crucially, we compute the joint predictions without ever obtaining any
independent predictions, while incorporating low-rank and smoothness
regularization. We achieve this by leveraging randomized algorithms for matrix
decomposition and kernel approximation. Furthermore, our techniques are
applicable to the multiclass setting. We apply our method to a variety of
multiclass and multilabel data sets, obtaining state-of-the-art results
Compact Nonlinear Maps and Circulant Extensions
Kernel approximation via nonlinear random feature maps is widely used in
speeding up kernel machines. There are two main challenges for the conventional
kernel approximation methods. First, before performing kernel approximation, a
good kernel has to be chosen. Picking a good kernel is a very challenging
problem in itself. Second, high-dimensional maps are often required in order to
achieve good performance. This leads to high computational cost in both
generating the nonlinear maps, and in the subsequent learning and prediction
process. In this work, we propose to optimize the nonlinear maps directly with
respect to the classification objective in a data-dependent fashion. The
proposed approach achieves kernel approximation and kernel learning in a joint
framework. This leads to much more compact maps without hurting the
performance. As a by-product, the same framework can also be used to achieve
more compact kernel maps to approximate a known kernel. We also introduce
Circulant Nonlinear Maps, which uses a circulant-structured projection matrix
to speed up the nonlinear maps for high-dimensional data
Approximating the Quadratic Transportation Metric in Near-Linear Time
Computing the quadratic transportation metric (also called the
-Wasserstein distance or root mean square distance) between two point
clouds, or, more generally, two discrete distributions, is a fundamental
problem in machine learning, statistics, computer graphics, and theoretical
computer science. A long line of work has culminated in a sophisticated
geometric algorithm due to Agarwal and Sharathkumar in 2014, which runs in time
, where is the number of points. However, obtaining
faster algorithms has proven difficult since the -Wasserstein distance is
known to have poor sketching and embedding properties, which limits the
effectiveness of geometric approaches. In this paper, we give an extremely
simple deterministic algorithm with runtime by using a
completely different approach based on entropic regularization, approximate
Sinkhorn scaling, and low-rank approximations of Gaussian kernel matrices. We
give explicit dependence of our algorithm on the dimension and precision of the
approximation.Comment: unchanged from v1; this article now superseded by arXiv:1812.0518
Gradient-based kernel dimension reduction for supervised learning
This paper proposes a novel kernel approach to linear dimension reduction for
supervised learning. The purpose of the dimension reduction is to find
directions in the input space to explain the output as effectively as possible.
The proposed method uses an estimator for the gradient of regression function,
based on the covariance operators on reproducing kernel Hilbert spaces. In
comparison with other existing methods, the proposed one has wide applicability
without strong assumptions on the distributions or the type of variables, and
uses computationally simple eigendecomposition. Experimental results show that
the proposed method successfully finds the effective directions with efficient
computation.Comment: 21 page
Dynamic Mode Decomposition based feature for Image Classification
Irrespective of the fact that Machine learning has produced groundbreaking
results, it demands an enormous amount of data in order to perform so. Even
though data production has been in its all-time high, almost all the data is
unlabelled, hence making them unsuitable for training the algorithms. This
paper proposes a novel method of extracting the features using Dynamic Mode
Decomposition (DMD). The experiment is performed using data samples from
Imagenet. The learning is done using SVM-linear, SVM-RBF, Random Kitchen Sink
approach (RKS). The results have shown that DMD features with RKS give
competing results.Comment: Selected for Spotlight presentation at TENCON 201
GP-select: Accelerating EM using adaptive subspace preselection
We propose a nonparametric procedure to achieve fast inference in generative
graphical models when the number of latent states is very large. The approach
is based on iterative latent variable preselection, where we alternate between
learning a 'selection function' to reveal the relevant latent variables, and
use this to obtain a compact approximation of the posterior distribution for
EM; this can make inference possible where the number of possible latent states
is e.g. exponential in the number of latent variables, whereas an exact
approach would be computationally unfeasible. We learn the selection function
entirely from the observed data and current EM state via Gaussian process
regression. This is by contrast with earlier approaches, where selection
functions were manually-designed for each problem setting. We show that our
approach performs as well as these bespoke selection functions on a wide
variety of inference problems: in particular, for the challenging case of a
hierarchical model for object localization with occlusion, we achieve results
that match a customized state-of-the-art selection method, at a far lower
computational cost
Random Maxout Features
In this paper, we propose and study random maxout features, which are
constructed by first projecting the input data onto sets of randomly generated
vectors with Gaussian elements, and then outputing the maximum projection value
for each set. We show that the resulting random feature map, when used in
conjunction with linear models, allows for the locally linear estimation of the
function of interest in classification tasks, and for the locally linear
embedding of points when used for dimensionality reduction or data
visualization. We derive generalization bounds for learning that assess the
error in approximating locally linear functions by linear functions in the
maxout feature space, and empirically evaluate the efficacy of the approach on
the MNIST and TIMIT classification tasks
- …