107 research outputs found

    Spectrum-Revealing Cholesky Factorization for Kernel Methods

    Full text link
    Kernel methods represent some of the most popular machine learning tools for data analysis. Since exact kernel methods can be prohibitively expensive for large problems, reliable low-rank matrix approximations and high-performance implementations have become indispensable for practical applications of kernel methods. In this work, we introduce spectrum-revealing Cholesky factorization, a reliable low-rank matrix factorization, for kernel matrix approximation. We also develop an efficient and effective randomized algorithm for computing this factorization. Our numerical experiments demonstrate that this algorithm is as effective as other Cholesky factorization based kernel methods on machine learning problems, but significantly faster.Comment: 7 pages, 8 figures, accepted by 2016 IEEE 16th International Conference on Data Minin

    Exponential Families for Conditional Random Fields

    Full text link
    In this paper we de ne conditional random elds in reproducing kernel Hilbert spaces and show connections to Gaussian Process classi cation. More speci cally, we prove decomposition results for undirected graphical models and we give constructions for kernels. Finally we present e cient means of solving the optimization problem using reduced rank decompositions and we show how stationarity can be exploited e ciently in the optimization process.Comment: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004

    Explicit Approximations of the Gaussian Kernel

    Full text link
    We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a better approximation in terms of the computational cost involved. This makes our "Taylor features" especially attractive for use on very large data sets, in conjunction with online or stochastic training.Comment: 11 pages, 2 tables, 2 figure

    Scalable Multilabel Prediction via Randomized Methods

    Full text link
    Modeling the dependence between outputs is a fundamental challenge in multilabel classification. In this work we show that a generic regularized nonlinearity mapping independent predictions to joint predictions is sufficient to achieve state-of-the-art performance on a variety of benchmark problems. Crucially, we compute the joint predictions without ever obtaining any independent predictions, while incorporating low-rank and smoothness regularization. We achieve this by leveraging randomized algorithms for matrix decomposition and kernel approximation. Furthermore, our techniques are applicable to the multiclass setting. We apply our method to a variety of multiclass and multilabel data sets, obtaining state-of-the-art results

    Compact Nonlinear Maps and Circulant Extensions

    Full text link
    Kernel approximation via nonlinear random feature maps is widely used in speeding up kernel machines. There are two main challenges for the conventional kernel approximation methods. First, before performing kernel approximation, a good kernel has to be chosen. Picking a good kernel is a very challenging problem in itself. Second, high-dimensional maps are often required in order to achieve good performance. This leads to high computational cost in both generating the nonlinear maps, and in the subsequent learning and prediction process. In this work, we propose to optimize the nonlinear maps directly with respect to the classification objective in a data-dependent fashion. The proposed approach achieves kernel approximation and kernel learning in a joint framework. This leads to much more compact maps without hurting the performance. As a by-product, the same framework can also be used to achieve more compact kernel maps to approximate a known kernel. We also introduce Circulant Nonlinear Maps, which uses a circulant-structured projection matrix to speed up the nonlinear maps for high-dimensional data

    Approximating the Quadratic Transportation Metric in Near-Linear Time

    Full text link
    Computing the quadratic transportation metric (also called the 22-Wasserstein distance or root mean square distance) between two point clouds, or, more generally, two discrete distributions, is a fundamental problem in machine learning, statistics, computer graphics, and theoretical computer science. A long line of work has culminated in a sophisticated geometric algorithm due to Agarwal and Sharathkumar in 2014, which runs in time O~(n3/2)\tilde{O}(n^{3/2}), where nn is the number of points. However, obtaining faster algorithms has proven difficult since the 22-Wasserstein distance is known to have poor sketching and embedding properties, which limits the effectiveness of geometric approaches. In this paper, we give an extremely simple deterministic algorithm with O~(n)\tilde{O}(n) runtime by using a completely different approach based on entropic regularization, approximate Sinkhorn scaling, and low-rank approximations of Gaussian kernel matrices. We give explicit dependence of our algorithm on the dimension and precision of the approximation.Comment: unchanged from v1; this article now superseded by arXiv:1812.0518

    Gradient-based kernel dimension reduction for supervised learning

    Full text link
    This paper proposes a novel kernel approach to linear dimension reduction for supervised learning. The purpose of the dimension reduction is to find directions in the input space to explain the output as effectively as possible. The proposed method uses an estimator for the gradient of regression function, based on the covariance operators on reproducing kernel Hilbert spaces. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the distributions or the type of variables, and uses computationally simple eigendecomposition. Experimental results show that the proposed method successfully finds the effective directions with efficient computation.Comment: 21 page

    Dynamic Mode Decomposition based feature for Image Classification

    Full text link
    Irrespective of the fact that Machine learning has produced groundbreaking results, it demands an enormous amount of data in order to perform so. Even though data production has been in its all-time high, almost all the data is unlabelled, hence making them unsuitable for training the algorithms. This paper proposes a novel method of extracting the features using Dynamic Mode Decomposition (DMD). The experiment is performed using data samples from Imagenet. The learning is done using SVM-linear, SVM-RBF, Random Kitchen Sink approach (RKS). The results have shown that DMD features with RKS give competing results.Comment: Selected for Spotlight presentation at TENCON 201

    GP-select: Accelerating EM using adaptive subspace preselection

    Full text link
    We propose a nonparametric procedure to achieve fast inference in generative graphical models when the number of latent states is very large. The approach is based on iterative latent variable preselection, where we alternate between learning a 'selection function' to reveal the relevant latent variables, and use this to obtain a compact approximation of the posterior distribution for EM; this can make inference possible where the number of possible latent states is e.g. exponential in the number of latent variables, whereas an exact approach would be computationally unfeasible. We learn the selection function entirely from the observed data and current EM state via Gaussian process regression. This is by contrast with earlier approaches, where selection functions were manually-designed for each problem setting. We show that our approach performs as well as these bespoke selection functions on a wide variety of inference problems: in particular, for the challenging case of a hierarchical model for object localization with occlusion, we achieve results that match a customized state-of-the-art selection method, at a far lower computational cost

    Random Maxout Features

    Full text link
    In this paper, we propose and study random maxout features, which are constructed by first projecting the input data onto sets of randomly generated vectors with Gaussian elements, and then outputing the maximum projection value for each set. We show that the resulting random feature map, when used in conjunction with linear models, allows for the locally linear estimation of the function of interest in classification tasks, and for the locally linear embedding of points when used for dimensionality reduction or data visualization. We derive generalization bounds for learning that assess the error in approximating locally linear functions by linear functions in the maxout feature space, and empirically evaluate the efficacy of the approach on the MNIST and TIMIT classification tasks
    • …