188,584 research outputs found
Feature Selection for Complex Systems Monitoring: an Application using Data Fusion
Emergence of automated and flexible production means leads to the need of robust monitoring systems. Such systems are aimed at the estimation of the production process state by deriving it as a function of critical variables, called features, that characterize the process condition. The problem of feature selection, which consists, given an original set of features, in finding a subset such the estimation accuracy of the monitoring system is the highest possible, is therefore of major importance for sensor-based monitoring applications. Considering real-world applications, feature selection can be tricky due to imperfection on available data collections: depending on the data acquisition conditions and the monitored process operating conditions, they can be heterogeneous, incomplete, imprecise, contradictory, or erroneous. Classical feature selection techniques lack of solutions to deal with uncertain data coming from different collections. Data fusion provides solutions to process these data collections altogether in order to achieve coherent feature selection, even in difficult cases involving imperfect data. In this work, condition monitoring of the tool in industrial drilling systems will serve as a basis to demonstrate how data fusion techniques can be used to perform feature selection in such difficult cases
An Open Source Pattern Recognition Toolbox for MATLAB
Pattern recognition and machine learning are becoming integral parts of
algorithms in a wide range of applications. Different algorithms and approaches
for machine learning include different tradeoffs between performance and
computation, so during algorithm development it is often necessary to explore a
variety of different approaches to a given task. A toolbox with a unified
framework across multiple pattern recognition techniques enables algorithm
developers the ability to rapidly evaluate different choices prior to
deployment. MATLAB is a widely used environment for algorithm development and
prototyping, and although several MATLAB toolboxes for pattern recognition are
currently available these are either incomplete, expensive, or restrictively
licensed. In this work we describe a MATLAB toolbox for pattern recognition and
machine learning known as the PRT (Pattern Recognition Toolbox), licensed under
the permissive MIT license. The PRT includes many popular techniques for data
preprocessing, supervised learning, clustering, regression and feature
selection, as well as a methodology for combining these components using a
simple, uniform syntax. The resulting algorithms can be evaluated using
cross-validation and a variety of scoring metrics to ensure robust performance
when the algorithm is deployed. This paper presents an overview of the PRT as
well as an example of usage on Fisher's Iris dataset
Distribution of Mutual Information from Complete and Incomplete Data
Mutual information is widely used, in a descriptive way, to measure the
stochastic dependence of categorical random variables. In order to address
questions such as the reliability of the descriptive value, one must consider
sample-to-population inferential approaches. This paper deals with the
posterior distribution of mutual information, as obtained in a Bayesian
framework by a second-order Dirichlet prior distribution. The exact analytical
expression for the mean, and analytical approximations for the variance,
skewness and kurtosis are derived. These approximations have a guaranteed
accuracy level of the order O(1/n^3), where n is the sample size. Leading order
approximations for the mean and the variance are derived in the case of
incomplete samples. The derived analytical expressions allow the distribution
of mutual information to be approximated reliably and quickly. In fact, the
derived expressions can be computed with the same order of complexity needed
for descriptive mutual information. This makes the distribution of mutual
information become a concrete alternative to descriptive mutual information in
many applications which would benefit from moving to the inductive side. Some
of these prospective applications are discussed, and one of them, namely
feature selection, is shown to perform significantly better when inductive
mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table
Robust Feature Selection by Mutual Information Distributions
Mutual information is widely used in artificial intelligence, in a
descriptive way, to measure the stochastic dependence of discrete random
variables. In order to address questions such as the reliability of the
empirical value, one must consider sample-to-population inferential approaches.
This paper deals with the distribution of mutual information, as obtained in a
Bayesian framework by a second-order Dirichlet prior distribution. The exact
analytical expression for the mean and an analytical approximation of the
variance are reported. Asymptotic approximations of the distribution are
proposed. The results are applied to the problem of selecting features for
incremental learning and classification of the naive Bayes classifier. A fast,
newly defined method is shown to outperform the traditional approach based on
empirical mutual information on a number of real data sets. Finally, a
theoretical development is reported that allows one to efficiently extend the
above methods to incomplete samples in an easy and effective way.Comment: 8 two-column page
Shape Interaction Matrix Revisited and Robustified: Efficient Subspace Clustering with Corrupted and Incomplete Data
The Shape Interaction Matrix (SIM) is one of the earliest approaches to
performing subspace clustering (i.e., separating points drawn from a union of
subspaces). In this paper, we revisit the SIM and reveal its connections to
several recent subspace clustering methods. Our analysis lets us derive a
simple, yet effective algorithm to robustify the SIM and make it applicable to
realistic scenarios where the data is corrupted by noise. We justify our method
by intuitive examples and the matrix perturbation theory. We then show how this
approach can be extended to handle missing data, thus yielding an efficient and
general subspace clustering algorithm. We demonstrate the benefits of our
approach over state-of-the-art subspace clustering methods on several
challenging motion segmentation and face clustering problems, where the data
includes corrupted and missing measurements.Comment: This is an extended version of our iccv15 pape
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …