66 research outputs found
Practical approaches to principal component analysis in the presence of missing values
Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. Probabilistic formulation of PCA provides a good foundation for handling missing values, and we introduce formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artificial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the Netflix problem
Validation of nonlinear PCA
Linear principal component analysis (PCA) can be extended to a nonlinear PCA
by using artificial neural networks. But the benefit of curved components
requires a careful control of the model complexity. Moreover, standard
techniques for model selection, including cross-validation and more generally
the use of an independent test set, fail when applied to nonlinear PCA because
of its inherent unsupervised characteristics. This paper presents a new
approach for validating the complexity of nonlinear PCA models by using the
error in missing data estimation as a criterion for model selection. It is
motivated by the idea that only the model of optimal complexity is able to
predict missing values with the highest accuracy. While standard test set
validation usually favours over-fitted nonlinear PCA models, the proposed model
validation approach correctly selects the optimal model complexity.Comment: 12 pages, 5 figure
Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood
We discuss the problem of estimating the number of principal components in
Principal Com- ponents Analysis (PCA). Despite of the importance of the problem
and the multitude of solutions proposed in the literature, it comes as a
surprise that there does not exist a coherent asymptotic framework which would
justify different approaches depending on the actual size of the data set. In
this paper we address this issue by presenting an approximate Bayesian approach
based on Laplace approximation and introducing a general method for building
the model selection criteria, called PEnalized SEmi-integrated Likelihood
(PESEL). Our general framework encompasses a variety of existing approaches
based on probabilistic models, like e.g. Bayesian Information Criterion for the
Probabilistic PCA (PPCA), and allows for construction of new criteria,
depending on the size of the data set at hand. Specifically, we define PESEL
when the number of variables substantially exceeds the number of observations.
We also report results of extensive simulation studies and real data analysis,
which illustrate good properties of our proposed criteria as compared to the
state-of- the-art methods and very recent proposals. Specifially, these
simulations show that PESEL based criteria can be quite robust against
deviations from the probabilistic model assumptions. Selected PESEL based
criteria for the estimation of the number of principal components are
implemented in R package varclust, which is available on github
(https://github.com/psobczyk/varclust).Comment: 31 pages, 7 figure
Dealing with missing data for prognostic purposes
Centrifugal compressors are considered one of the most critical components in oil industry, making the minimization of their downtime and the maximization of their availability a major target. Maintenance is thought to be a key aspect towards achieving this goal, leading to various maintenance schemes being proposed over the years. Condition based maintenance and prognostics and health management (CBM/PHM), which is relying on the concepts of diagnostics and prognostics, has been gaining ground over the last years due to its ability of being able to plan the maintenance schedule in advance. The successful application of this policy is heavily dependent on the quality of data used and a major issue affecting it, is that of missing data. Missing data's presence may compromise the information contained within a set, thus having a significant effect on the conclusions that can be drawn from the data, as there might be bias or misleading results. Consequently, it is important to address this matter. A number of methodologies to recover the data, called imputation techniques, have been proposed. This paper reviews the most widely used techniques and presents a case study with the use of actual industrial centrifugal compressor data, in order to identify the most suitable ones
Digit Recognition Using Single Layer Neural Network with Principal Component Analysis
This paper presents an approach to digit recognition using single layer neural network classifier with Principal Component Analysis (PCA). The handwritten digit recognition is an important area of research as there are so many applications which are using handwritten recognition and it can also be applied to new application. There are many algorithms applied to this computer vision problem and many more algorithms are continuously developed on this to make the handwritten recognition classify digits more accurately with less computation involved. The proposed model in this paper aims to reduce the features to reduce computation requirements and successfully classify the digit into 10 categories (0 to 9). The system designed consists of backward propagation (BP) neural network and is trained and tested on the MNIST dataset of handwritten digit. The proposed system was able to obtain 98.39% accuracy on the MNIST 10,000 test dataset. The Principal Component Analysis (PCA) is used for feature extraction to curtail the computational and training time and at the same time produce high accuracy. It was clearly observed that the training time is reduced by up to 80% depending on the number of principal component selected. We will consider not only the accuracy, but also the training time, recognition time and memory requirements for entire process. Further, we identified the digits which were misclassified by the algorithm. Finally, we generate our own test dataset and predict the labels using this system
- …