15 research outputs found

    Missing Value Imputation With Unsupervised Backpropagation

    Full text link
    Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods

    Ship machinery condition monitoring using vibration data through supervised learning

    Get PDF
    This paper aims to present an integrated methodology for the monitoring of marine machinery using vibration data. Monitoring of machinery is a crucial aspect of maintenance optimisation that is required for the vessel operation to remain sustainable and profitable. The proposed methodology will train models using pre-classified (healthy/faulty) data and then classify new data points using the models developed. For this, vibration points are first acquired, appropriately processed and stored in a database. Specific features are then extracted from the data and stored. These data are then used to train supervised models pertinent to specific machinery components. Finally, new data are compared against the models developed in order to evaluate their condition. The above will provide a flexible but robust framework for the early detection of emerging machinery faults. This will lead to minimisation of ship downtime and increase of the ship’s operability and income through operational enhancement

    K-Clustering Methods for Investigating Social-Environmental and Natural-Environmental Features Based on Air Quality Index

    Get PDF
    Air pollution has caused environmental and health hazards across the globe, particularly in emerging countries such as China. In this article, we propose the use of air quality index and the development of advanced data processing, analysis, and visualization techniques based on the AI-based k-clustering method. We analyze the air quality data based on seven key attributes and discuss its implications. Our results provide meaningful values and contributions to the current research. Our future work will include the use of advanced AI algorithms and big data techniques to ensure better performance, accuracy and real-time checks

    Preprocessing of missing values using robust association rules

    Full text link

    Differentially Private Data Generation with Missing Data

    Full text link
    Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.Comment: 18 pages, 9 figures, 2 table

    Physics-inspired Replica Approaches to Computer Science Problems

    Get PDF
    We study machine learning class classification problems and combinatorial optimization problems using physics inspired replica approaches. In the current work, we focus on the traveling salesman problem which is one of the most famous problems in the entire field of combinatorial optimization. Our approach is specifically motivated by the desire to avoid trapping in metastable local minima-a common occurrence in hard problems with multiple extrema. Our method involves (i) coupling otherwise independent simulations of a system (“replicas”) via geometrical distances as well as (ii) probabilistic inference applied to the solutions found by individual replicas. In particular, we apply our method to the well-known “k-opt” algorithm and examine two particular cases-k = 2 and k = 3. With the aid of geometrical coupling alone, we are able to determine for the optimum tour length on systems up to 280 cities (an order of magnitude larger than the largest systems typically solved by the bare k = 3 opt). The probabilistic replica-based inference approach improves k - opt even further and determines the optimal solution of a problem with 318 cities. In this work, we also formulate a supervised machine learning algorithm for classification problems which is called “Stochastic Replica Voting Machine” (SRVM). The method is based on the representations of known data via multiple linear expansions in terms of various stochastic functions. The algorithm is developed, implemented and applied to a binary and a 3-class classification problems in material science. Here, we employ SRVM to predict candidate compounds capable of forming cubic Perovskite structure and further classify binary (AB) solids. We demonstrated that our SRVM method exceeds the well-known Support Vector Machine (SVM) in terms of accuracy when predicting the cubic Perovskite structure. The algorithm has also been tested on 8 diverse training data sets of various types and feature space dimensions from UCI machine learning repository. It has been shown to consistently match or exceed the accuracy of existing algorithms, while simultaneously avoiding many of their pitfalls
    corecore