220,784 research outputs found

    On methods for prediction based on complex data with missing values and robust principal component analysis

    Get PDF
    Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data. Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles

    Analisis Penanganan Missing Value dengan Metode Robust Least Squares Estimation with Principal Component (RLSP)

    Get PDF
    ABSTRAKSI: Pada Tugas Akhir ini membahas mengenai salah satu cara penanganan missing value (MV) yaitu imputasi. Imputasi merupakan proses mengisi missing value secara otomatis dengan menggunakan algoritma tertentu. Metode imputasi yang digunakan dalam Tugas Akhir ini adalah Robust Least Squares Estimation with Principal Component (RLSP). Metode RLSP merupakan salah satu metode imputasi berbasis ketetanggan dan regresi. Proses imputasi dengan metode RLSP meliputi 3 tahap yaitu proses pemilihan k-nearest instance, proses Principal Component Analysis (PCA) dan proses estimasi MV berdasarkan hasil regresi median.Berdasarkan parameter Normalized Root Mean Squared Error (NRMSE) dan hasil klasifikasi data hasil imputasi, metode RLSP mampu menebak nilai MV mendekati nilai aktualnya. Performansi metode ini dipengaruhi oleh jumlah k-nearest instance dan Principal Component (PC). Jumlah k-nearest instance dan Principal Component (PC) optimal dicapai ketika data set memiliki variansi kecil dan record-record yang terpilih sebagai pengestimasi memiliki kemiripan yang tinggi dengan record ber-MV. Principal Component Analysis yang diterapkan dalam metode ini membuat metode ini cocok apabila digunakan pada data yang memiliki dimensi yang sangat besar. Selain itu, meskipun data mengandung outlier sampai 10%, metode ini masih mampu menebak nilai MV dengan baik.Kata Kunci : missing value, Robust Least Squares Estimation with Principal Component (RLSP), outlier,ABSTRACT: One way of handling missing value (MV) is imputation. Imputation is the process of filling missing value automatically using a specific algorithm. Imputation method is used in this Final Project is Robust Least Squares Estimation with Principal Component (RLSP). RLSP method combine nearest neighbor and regression to estimate missing value. Imputation process RLSP methods include 3 stages: selecting the k-nearest instances, Principal Component Analysis (PCA) process and estimation process based on the median regression results.Based on the parameter Normalized Root Mean Squared Error (NRMSE) and classification, the method is able to predict the missing value approaching the actual value. Performance of this method is influenced by the number of k-nearest instances and Principal Component (PC). The number of k-nearest instances and Principal Component (PC) optimum is achieved when the data set has a small variance and some records being selected as the estimator has a high similarity with the record which is contain missing value. PCA is applied in this method make this method suitable when used on data that has a very large dimension. In addition, although the data contain outliers up to 10%, this method is still able to predict the value of MV it well.Keyword: missing value, Robust Least Squares Estimation with Principal Component (RLSP), outlier, Normalize

    Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models

    Get PDF
    For many least-squares decomposition models efficient algorithms are well known. A more difficult problem arises in decomposition models where each residual is weighted by a nonnegative value. A special case is principal components analysis with missing data. Kiers (1997) discusses an algorithm for minimizing weighteddecomposition models by iterative majorization. In this paper, we for computing a solution. We will show that the algorithm by Kiers is a special case of our algorithm. Here, we will apply weighted majorization to weighted principal components analysis, robust Procrustes analysis, and logistic bi-additive models of which the two parameter logistic model in item response theory is a specialcase. Simulation studies show that weighted majorization is generally faster than the method by Kiers by a factor one to four and obtains the same or better quality solutions. For logistic bi-additive models, we propose a new iterative majorization algorithm called logistic majorization.iterative majorization;IRT;logistic bi-additive model;robust Procrustes analysis;weighted principal component analysis;two parameter logistic model

    Robust imputation method for missing values in microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray gene expression data, missing values are often encountered. Most multivariate statistical methods proposed for microarray data analysis cannot be applied when the data have missing values. Numerous imputation algorithms have been proposed to estimate the missing values. In this study, we develop a robust least squares estimation with principal components (RLSP) method by extending the local least square imputation (LLSimpute) method. The basic idea of our method is to employ quantile regression to estimate the missing values, using the estimated principal components of a selected set of similar genes.</p> <p>Results</p> <p>Using the normalized root mean squares error, the performance of the proposed method was evaluated and compared with other previously proposed imputation methods. The proposed RLSP method clearly outperformed the weighted <it>k</it>-nearest neighbors imputation (kNNimpute) method and LLSimpute method, and showed competitive results with Bayesian principal component analysis (BPCA) method.</p> <p>Conclusion</p> <p>Adapting the principal components of the selected genes and employing the quantile regression model improved the robustness and accuracy of missing value imputation. Thus, the proposed RLSP method is, according to our empirical studies, more robust and accurate than the widely used kNNimpute and LLSimpute methods.</p

    Robust Principal Component Analysis on Graphs

    Get PDF
    Principal Component Analysis (PCA) is the most widely used tool for linear dimensionality reduction and clustering. Still it is highly sensitive to outliers and does not scale well with respect to the number of data samples. Robust PCA solves the first issue with a sparse penalty term. The second issue can be handled with the matrix factorization model, which is however non-convex. Besides, PCA based clustering can also be enhanced by using a graph of data similarity. In this article, we introduce a new model called "Robust PCA on Graphs" which incorporates spectral graph regularization into the Robust PCA framework. Our proposed model benefits from 1) the robustness of principal components to occlusions and missing values, 2) enhanced low-rank recovery, 3) improved clustering property due to the graph smoothness assumption on the low-rank matrix, and 4) convexity of the resulting optimization problem. Extensive experiments on 8 benchmark, 3 video and 2 artificial datasets with corruptions clearly reveal that our model outperforms 10 other state-of-the-art models in its clustering and low-rank recovery tasks

    Reliable Eigenspectra for New Generation Surveys

    Get PDF
    We present a novel technique to overcome the limitations of the applicability of Principal Component Analysis to typical real-life data sets, especially astronomical spectra. Our new approach addresses the issues of outliers, missing information, large number of dimensions and the vast amount of data by combining elements of robust statistics and recursive algorithms that provide improved eigensystem estimates step-by-step. We develop a generic mechanism for deriving reliable eigenspectra without manual data censoring, while utilising all the information contained in the observations. We demonstrate the power of the methodology on the attractive collection of the VIMOS VLT Deep Survey spectra that manifest most of the challenges today, and highlight the improvements over previous workarounds, as well as the scalability of our approach to collections with sizes of the Sloan Digital Sky Survey and beyond.Comment: 7 pages, 3 figures, accepted to MNRA
    corecore