419 research outputs found

    Parsimonious Mahalanobis Kernel for the Classification of High Dimensional Data

    Full text link
    The classification of high dimensional data with kernel methods is considered in this article. Exploit- ing the emptiness property of high dimensional spaces, a kernel based on the Mahalanobis distance is proposed. The computation of the Mahalanobis distance requires the inversion of a covariance matrix. In high dimensional spaces, the estimated covariance matrix is ill-conditioned and its inversion is unstable or impossible. Using a parsimonious statistical model, namely the High Dimensional Discriminant Analysis model, the specific signal and noise subspaces are estimated for each considered class making the inverse of the class specific covariance matrix explicit and stable, leading to the definition of a parsimonious Mahalanobis kernel. A SVM based framework is used for selecting the hyperparameters of the parsimonious Mahalanobis kernel by optimizing the so-called radius-margin bound. Experimental results on three high dimensional data sets show that the proposed kernel is suitable for classifying high dimensional data, providing better classification accuracies than the conventional Gaussian kernel

    Sparse multinomial kernel discriminant analysis (sMKDA)

    No full text
    Dimensionality reduction via canonical variate analysis (CVA) is important for pattern recognition and has been extended variously to permit more flexibility, e.g. by "kernelizing" the formulation. This can lead to over-fitting, usually ameliorated by regularization. Here, a method for sparse, multinomial kernel discriminant analysis (sMKDA) is proposed, using a sparse basis to control complexity. It is based on the connection between CVA and least-squares, and uses forward selection via orthogonal least-squares to approximate a basis, generalizing a similar approach for binomial problems. Classification can be performed directly via minimum Mahalanobis distance in the canonical variates. sMKDA achieves state-of-the-art performance in terms of accuracy and sparseness on 11 benchmark datasets

    An Emergent Space for Distributed Data with Hidden Internal Order through Manifold Learning

    Full text link
    Manifold-learning techniques are routinely used in mining complex spatiotemporal data to extract useful, parsimonious data representations/parametrizations; these are, in turn, useful in nonlinear model identification tasks. We focus here on the case of time series data that can ultimately be modelled as a spatially distributed system (e.g. a partial differential equation, PDE), but where we do not know the space in which this PDE should be formulated. Hence, even the spatial coordinates for the distributed system themselves need to be identified - to emerge from - the data mining process. We will first validate this emergent space reconstruction for time series sampled without space labels in known PDEs; this brings up the issue of observability of physical space from temporal observation data, and the transition from spatially resolved to lumped (order-parameter-based) representations by tuning the scale of the data mining kernels. We will then present actual emergent space discovery illustrations. Our illustrative examples include chimera states (states of coexisting coherent and incoherent dynamics), and chaotic as well as quasiperiodic spatiotemporal dynamics, arising in partial differential equations and/or in heterogeneous networks. We also discuss how data-driven spatial coordinates can be extracted in ways invariant to the nature of the measuring instrument. Such gauge-invariant data mining can go beyond the fusion of heterogeneous observations of the same system, to the possible matching of apparently different systems

    Regression modelling with I-priors

    Full text link
    We introduce the I-prior methodology as a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses. It can also be used for multi-class classification, with low or high dimensional covariates. The I-prior is generally defined as a maximum entropy prior. For a regression function, the I-prior is Gaussian with covariance kernel proportional to the Fisher information on the regression function, which is estimated by its posterior distribution under the I-prior. The I-prior has the intuitively appealing property that the more information is available on a linear functional of the regression function, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution. Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected. Using a number of widely analyzed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2019)

    Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications

    Get PDF
    Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins

    An Illustration of New Methods in Machine Condition Monitoring, Part II: Adaptive outlier detection

    Get PDF
    There have been many recent developments in the application of data-based methods to machine condition monitoring. A powerful methodology based on machine learning has emerged, where diagnostics are based on a two-step procedure: extraction of damagesensitive features, followed by unsupervised learning (novelty detection) or supervised learning (classification). The objective of the current pair of papers is simply to illustrate one state-ofthe-art procedure for each step, using synthetic data representative of reality in terms of size and complexity. The second paper in the pair will deal with novelty detection. Although there has been considerable progress in the use of outlier analysis for novelty detection, most of the papers produced so far have suffered from the fact that simple algorithms break down if multiple outliers are present or if damage is already present in a training set. The objective of the current paper is to illustrate the use of phase-space thresholding; an algorithm which has the ability to detect multiple outliers inclusively in a data set
    corecore