15 research outputs found
Missing Value Imputation With Unsupervised Backpropagation
Many data mining and data analysis techniques operate on dense matrices or
complete tables of data. Real-world data sets, however, often contain unknown
values. Even many classification algorithms that are designed to operate with
missing values still exhibit deteriorated accuracy. One approach to handling
missing values is to fill in (impute) the missing values. In this paper, we
present a technique for unsupervised learning called Unsupervised
Backpropagation (UBP), which trains a multi-layer perceptron to fit to the
manifold sampled by a set of observed point-vectors. We evaluate UBP with the
task of imputing missing values in datasets, and show that UBP is able to
predict missing values with significantly lower sum-squared error than other
collaborative filtering and imputation techniques. We also demonstrate with 24
datasets and 9 supervised learning algorithms that classification accuracy is
usually higher when randomly-withheld values are imputed using UBP, rather than
with other methods
Ship machinery condition monitoring using vibration data through supervised learning
This paper aims to present an integrated methodology for the monitoring of marine machinery using vibration data. Monitoring of machinery is a crucial aspect of maintenance optimisation that is required for the vessel operation to remain sustainable and profitable. The proposed methodology will train models using pre-classified (healthy/faulty) data and then classify new data points using the models developed. For this, vibration points are first acquired, appropriately processed and stored in a database. Specific features are then extracted from the data and stored. These data are then used to train supervised models pertinent to specific machinery components. Finally, new data are compared against the models developed in order to evaluate their condition. The above will provide a flexible but robust framework for the early detection of emerging machinery faults. This will lead to minimisation of ship downtime and increase of the ship’s operability and income through operational enhancement
K-Clustering Methods for Investigating Social-Environmental and Natural-Environmental Features Based on Air Quality Index
Air pollution has caused environmental and health hazards across the globe, particularly in emerging countries such as China. In this article, we propose the use of air quality index and the development of advanced data processing, analysis, and visualization techniques based on the AI-based k-clustering method. We analyze the air quality data based on seven key attributes and discuss its implications. Our results provide meaningful values and contributions to the current research. Our future work will include the use of advanced AI algorithms and big data techniques to ensure better performance, accuracy and real-time checks
Differentially Private Data Generation with Missing Data
Despite several works that succeed in generating synthetic data with
differential privacy (DP) guarantees, they are inadequate for generating
high-quality synthetic data when the input data has missing values. In this
work, we formalize the problems of DP synthetic data with missing values and
propose three effective adaptive strategies that significantly improve the
utility of the synthetic data on four real-world datasets with different types
and levels of missing data and privacy requirements. We also identify the
relationship between privacy impact for the complete ground truth data and
incomplete data for these DP synthetic data generation algorithms. We model the
missing mechanisms as a sampling process to obtain tighter upper bounds for the
privacy guarantees to the ground truth data. Overall, this study contributes to
a better understanding of the challenges and opportunities for using private
synthetic data generation algorithms in the presence of missing data.Comment: 18 pages, 9 figures, 2 table
Physics-inspired Replica Approaches to Computer Science Problems
We study machine learning class classification problems and combinatorial optimization problems using physics inspired replica approaches. In the current work, we focus on the traveling salesman problem which is one of the most famous problems in the entire field of combinatorial optimization. Our approach is specifically motivated by the desire to avoid trapping in metastable local minima-a common occurrence in hard problems with multiple extrema. Our method involves (i) coupling otherwise independent simulations of a system (“replicas”) via geometrical distances as well as (ii) probabilistic inference applied to the solutions found by individual replicas. In particular, we apply our method to the well-known “k-opt” algorithm and examine two particular cases-k = 2 and k = 3. With the aid of geometrical coupling alone, we are able to determine for the optimum tour length on systems up to 280 cities (an order of magnitude larger than the largest systems typically solved by the bare k = 3 opt). The probabilistic replica-based inference approach improves k - opt even further and determines the optimal solution of a problem with 318 cities. In this work, we also formulate a supervised machine learning algorithm for classification problems which is called “Stochastic Replica Voting Machine” (SRVM). The method is based on the representations of known data via multiple linear expansions in terms of various stochastic functions. The algorithm is developed, implemented and applied to a binary and a 3-class classification problems in material science. Here, we employ SRVM to predict candidate compounds capable of forming cubic Perovskite structure and further classify binary (AB) solids. We demonstrated that our SRVM method exceeds the well-known Support Vector Machine (SVM) in terms of accuracy when predicting the cubic Perovskite structure. The algorithm has also been tested on 8 diverse training data sets of various types and feature space dimensions from UCI machine learning repository. It has been shown to consistently match or exceed the accuracy of existing algorithms, while simultaneously avoiding many of their pitfalls