419 research outputs found
Parsimonious Mahalanobis Kernel for the Classification of High Dimensional Data
The classification of high dimensional data with kernel methods is considered
in this article. Exploit- ing the emptiness property of high dimensional
spaces, a kernel based on the Mahalanobis distance is proposed. The computation
of the Mahalanobis distance requires the inversion of a covariance matrix. In
high dimensional spaces, the estimated covariance matrix is ill-conditioned and
its inversion is unstable or impossible. Using a parsimonious statistical
model, namely the High Dimensional Discriminant Analysis model, the specific
signal and noise subspaces are estimated for each considered class making the
inverse of the class specific covariance matrix explicit and stable, leading to
the definition of a parsimonious Mahalanobis kernel. A SVM based framework is
used for selecting the hyperparameters of the parsimonious Mahalanobis kernel
by optimizing the so-called radius-margin bound. Experimental results on three
high dimensional data sets show that the proposed kernel is suitable for
classifying high dimensional data, providing better classification accuracies
than the conventional Gaussian kernel
Sparse multinomial kernel discriminant analysis (sMKDA)
Dimensionality reduction via canonical variate analysis (CVA) is important for pattern recognition and has been extended variously to permit more flexibility, e.g. by "kernelizing" the formulation. This can lead to over-fitting, usually ameliorated by regularization. Here, a method for sparse, multinomial kernel discriminant analysis (sMKDA) is proposed, using a sparse basis to control complexity. It is based on the connection between CVA and least-squares, and uses forward selection via orthogonal least-squares to approximate a basis, generalizing a similar approach for binomial problems. Classification can be performed directly via minimum Mahalanobis distance in the canonical variates. sMKDA achieves state-of-the-art performance in terms of accuracy and sparseness on 11 benchmark datasets
An Emergent Space for Distributed Data with Hidden Internal Order through Manifold Learning
Manifold-learning techniques are routinely used in mining complex
spatiotemporal data to extract useful, parsimonious data
representations/parametrizations; these are, in turn, useful in nonlinear model
identification tasks. We focus here on the case of time series data that can
ultimately be modelled as a spatially distributed system (e.g. a partial
differential equation, PDE), but where we do not know the space in which this
PDE should be formulated. Hence, even the spatial coordinates for the
distributed system themselves need to be identified - to emerge from - the data
mining process. We will first validate this emergent space reconstruction for
time series sampled without space labels in known PDEs; this brings up the
issue of observability of physical space from temporal observation data, and
the transition from spatially resolved to lumped (order-parameter-based)
representations by tuning the scale of the data mining kernels. We will then
present actual emergent space discovery illustrations. Our illustrative
examples include chimera states (states of coexisting coherent and incoherent
dynamics), and chaotic as well as quasiperiodic spatiotemporal dynamics,
arising in partial differential equations and/or in heterogeneous networks. We
also discuss how data-driven spatial coordinates can be extracted in ways
invariant to the nature of the measuring instrument. Such gauge-invariant data
mining can go beyond the fusion of heterogeneous observations of the same
system, to the possible matching of apparently different systems
Regression modelling with I-priors
We introduce the I-prior methodology as a unifying framework for estimating a
variety of regression models, including varying coefficient, multilevel,
longitudinal models, and models with functional covariates and responses. It
can also be used for multi-class classification, with low or high dimensional
covariates.
The I-prior is generally defined as a maximum entropy prior. For a regression
function, the I-prior is Gaussian with covariance kernel proportional to the
Fisher information on the regression function, which is estimated by its
posterior distribution under the I-prior. The I-prior has the intuitively
appealing property that the more information is available on a linear
functional of the regression function, the larger the prior variance, and the
smaller the influence of the prior mean on the posterior distribution.
Advantages compared to competing methods, such as Gaussian process regression
or Tikhonov regularization, are ease of estimation and model comparison. In
particular, we develop an EM algorithm with a simple E and M step for
estimating hyperparameters, facilitating estimation for complex models. We also
propose a novel parsimonious model formulation, requiring a single scale
parameter for each (possibly multidimensional) covariate and no further
parameters for interaction effects. This simplifies estimation because fewer
hyperparameters need to be estimated, and also simplifies model comparison of
models with the same covariates but different interaction effects; in this
case, the model with the highest estimated likelihood can be selected.
Using a number of widely analyzed real data sets we show that predictive
performance of our methodology is competitive. An R-package implementing the
methodology is available (Jamil, 2019)
Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification
performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins
An Illustration of New Methods in Machine Condition Monitoring, Part II: Adaptive outlier detection
There have been many recent developments in the application of data-based
methods to machine condition monitoring. A powerful methodology based on machine learning
has emerged, where diagnostics are based on a two-step procedure: extraction of damagesensitive
features, followed by unsupervised learning (novelty detection) or supervised learning
(classification). The objective of the current pair of papers is simply to illustrate one state-ofthe-art
procedure for each step, using synthetic data representative of reality in terms of size
and complexity. The second paper in the pair will deal with novelty detection. Although there
has been considerable progress in the use of outlier analysis for novelty detection, most of the
papers produced so far have suffered from the fact that simple algorithms break down if multiple
outliers are present or if damage is already present in a training set. The objective of the current
paper is to illustrate the use of phase-space thresholding; an algorithm which has the ability to
detect multiple outliers inclusively in a data set
- …