188,584 research outputs found

    Feature Selection for Complex Systems Monitoring: an Application using Data Fusion

    Get PDF
    Emergence of automated and flexible production means leads to the need of robust monitoring systems. Such systems are aimed at the estimation of the production process state by deriving it as a function of critical variables, called features, that characterize the process condition. The problem of feature selection, which consists, given an original set of features, in finding a subset such the estimation accuracy of the monitoring system is the highest possible, is therefore of major importance for sensor-based monitoring applications. Considering real-world applications, feature selection can be tricky due to imperfection on available data collections: depending on the data acquisition conditions and the monitored process operating conditions, they can be heterogeneous, incomplete, imprecise, contradictory, or erroneous. Classical feature selection techniques lack of solutions to deal with uncertain data coming from different collections. Data fusion provides solutions to process these data collections altogether in order to achieve coherent feature selection, even in difficult cases involving imperfect data. In this work, condition monitoring of the tool in industrial drilling systems will serve as a basis to demonstrate how data fusion techniques can be used to perform feature selection in such difficult cases

    An Open Source Pattern Recognition Toolbox for MATLAB

    Full text link
    Pattern recognition and machine learning are becoming integral parts of algorithms in a wide range of applications. Different algorithms and approaches for machine learning include different tradeoffs between performance and computation, so during algorithm development it is often necessary to explore a variety of different approaches to a given task. A toolbox with a unified framework across multiple pattern recognition techniques enables algorithm developers the ability to rapidly evaluate different choices prior to deployment. MATLAB is a widely used environment for algorithm development and prototyping, and although several MATLAB toolboxes for pattern recognition are currently available these are either incomplete, expensive, or restrictively licensed. In this work we describe a MATLAB toolbox for pattern recognition and machine learning known as the PRT (Pattern Recognition Toolbox), licensed under the permissive MIT license. The PRT includes many popular techniques for data preprocessing, supervised learning, clustering, regression and feature selection, as well as a methodology for combining these components using a simple, uniform syntax. The resulting algorithms can be evaluated using cross-validation and a variety of scoring metrics to ensure robust performance when the algorithm is deployed. This paper presents an overview of the PRT as well as an example of usage on Fisher's Iris dataset

    Distribution of Mutual Information from Complete and Incomplete Data

    Full text link
    Mutual information is widely used, in a descriptive way, to measure the stochastic dependence of categorical random variables. In order to address questions such as the reliability of the descriptive value, one must consider sample-to-population inferential approaches. This paper deals with the posterior distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean, and analytical approximations for the variance, skewness and kurtosis are derived. These approximations have a guaranteed accuracy level of the order O(1/n^3), where n is the sample size. Leading order approximations for the mean and the variance are derived in the case of incomplete samples. The derived analytical expressions allow the distribution of mutual information to be approximated reliably and quickly. In fact, the derived expressions can be computed with the same order of complexity needed for descriptive mutual information. This makes the distribution of mutual information become a concrete alternative to descriptive mutual information in many applications which would benefit from moving to the inductive side. Some of these prospective applications are discussed, and one of them, namely feature selection, is shown to perform significantly better when inductive mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table

    Robust Feature Selection by Mutual Information Distributions

    Full text link
    Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.Comment: 8 two-column page

    Shape Interaction Matrix Revisited and Robustified: Efficient Subspace Clustering with Corrupted and Incomplete Data

    Full text link
    The Shape Interaction Matrix (SIM) is one of the earliest approaches to performing subspace clustering (i.e., separating points drawn from a union of subspaces). In this paper, we revisit the SIM and reveal its connections to several recent subspace clustering methods. Our analysis lets us derive a simple, yet effective algorithm to robustify the SIM and make it applicable to realistic scenarios where the data is corrupted by noise. We justify our method by intuitive examples and the matrix perturbation theory. We then show how this approach can be extended to handle missing data, thus yielding an efficient and general subspace clustering algorithm. We demonstrate the benefits of our approach over state-of-the-art subspace clustering methods on several challenging motion segmentation and face clustering problems, where the data includes corrupted and missing measurements.Comment: This is an extended version of our iccv15 pape

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore