66 research outputs found

    Practical approaches to principal component analysis in the presence of missing values

    Get PDF
    Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. Probabilistic formulation of PCA provides a good foundation for handling missing values, and we introduce formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artificial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the Netflix problem

    Validation of nonlinear PCA

    Full text link
    Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinear PCA because of its inherent unsupervised characteristics. This paper presents a new approach for validating the complexity of nonlinear PCA models by using the error in missing data estimation as a criterion for model selection. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours over-fitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity.Comment: 12 pages, 5 figure

    Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood

    Full text link
    We discuss the problem of estimating the number of principal components in Principal Com- ponents Analysis (PCA). Despite of the importance of the problem and the multitude of solutions proposed in the literature, it comes as a surprise that there does not exist a coherent asymptotic framework which would justify different approaches depending on the actual size of the data set. In this paper we address this issue by presenting an approximate Bayesian approach based on Laplace approximation and introducing a general method for building the model selection criteria, called PEnalized SEmi-integrated Likelihood (PESEL). Our general framework encompasses a variety of existing approaches based on probabilistic models, like e.g. Bayesian Information Criterion for the Probabilistic PCA (PPCA), and allows for construction of new criteria, depending on the size of the data set at hand. Specifically, we define PESEL when the number of variables substantially exceeds the number of observations. We also report results of extensive simulation studies and real data analysis, which illustrate good properties of our proposed criteria as compared to the state-of- the-art methods and very recent proposals. Specifially, these simulations show that PESEL based criteria can be quite robust against deviations from the probabilistic model assumptions. Selected PESEL based criteria for the estimation of the number of principal components are implemented in R package varclust, which is available on github (https://github.com/psobczyk/varclust).Comment: 31 pages, 7 figure

    Dealing with missing data for prognostic purposes

    Get PDF
    Centrifugal compressors are considered one of the most critical components in oil industry, making the minimization of their downtime and the maximization of their availability a major target. Maintenance is thought to be a key aspect towards achieving this goal, leading to various maintenance schemes being proposed over the years. Condition based maintenance and prognostics and health management (CBM/PHM), which is relying on the concepts of diagnostics and prognostics, has been gaining ground over the last years due to its ability of being able to plan the maintenance schedule in advance. The successful application of this policy is heavily dependent on the quality of data used and a major issue affecting it, is that of missing data. Missing data's presence may compromise the information contained within a set, thus having a significant effect on the conclusions that can be drawn from the data, as there might be bias or misleading results. Consequently, it is important to address this matter. A number of methodologies to recover the data, called imputation techniques, have been proposed. This paper reviews the most widely used techniques and presents a case study with the use of actual industrial centrifugal compressor data, in order to identify the most suitable ones

    Digit Recognition Using Single Layer Neural Network with Principal Component Analysis

    Get PDF
    This paper presents an approach to digit recognition using single layer neural network classifier with Principal Component Analysis (PCA). The handwritten digit recognition is an important area of research as there are so many applications which are using handwritten recognition and it can also be applied to new application. There are many algorithms applied to this computer vision problem and many more algorithms are continuously developed on this to make the handwritten recognition classify digits more accurately with less computation involved. The proposed model in this paper aims to reduce the features to reduce computation requirements and successfully classify the digit into 10 categories (0 to 9). The system designed consists of backward propagation (BP) neural network and is trained and tested on the MNIST dataset of handwritten digit. The proposed system was able to obtain 98.39% accuracy on the MNIST 10,000 test dataset. The Principal Component Analysis (PCA) is used for feature extraction to curtail the computational and training time and at the same time produce high accuracy. It was clearly observed that the training time is reduced by up to 80% depending on the number of principal component selected. We will consider not only the accuracy, but also the training time, recognition time and memory requirements for entire process. Further, we identified the digits which were misclassified by the algorithm. Finally, we generate our own test dataset and predict the labels using this system
    • …
    corecore