220,784 research outputs found
Recommended from our members
Robust principal component analysis of electromagnetic arrays with missing data
We describe a new algorithm for robust principal component analysis (PCA) of electromagnetic (EM) array data, extending previously developed multivariate methods to include arrays with large data gaps, and only partial overlap between site occupations. Our approach is based on a criss-cross regression scheme in which polarization parameters and spatial modes are alternately estimated with robust regression procedures. The basic scheme can be viewed as an expectation robust (ER) algorithm, of the sort that has been widely discussed in the statistical literature in the context of robust PCA, but with details of the scheme tailored to the physical specifics of EM array observations. We have tested our algorithm with synthetic and real data, including data denial experiments where we have created artificial gaps, and compared results obtained with full and incomplete data arrays. These tests reveal that for modest amounts of missing data (up to 20 per cent or so) the algorithm performs well, reproducing essentially the same dominant spatial modes that would be obtained from analysis of the complete array. The algorithm thus makes multivariate analysis practical for the first time for large heterogeneous arrays, as we illustrate by application to two different EM arrays.Keywords: Time series analysis, Geomagnetic induction, Magnetotelluri
On methods for prediction based on complex data with missing values and robust principal component analysis
Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data.
Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems.
Theoretical results confirm the potential better prediction performance of multiple imputation ensembles
Analisis Penanganan Missing Value dengan Metode Robust Least Squares Estimation with Principal Component (RLSP)
ABSTRAKSI: Pada Tugas Akhir ini membahas mengenai salah satu cara penanganan missing value (MV) yaitu imputasi. Imputasi merupakan proses mengisi missing value secara otomatis dengan menggunakan algoritma tertentu. Metode imputasi yang digunakan dalam Tugas Akhir ini adalah Robust Least Squares Estimation with Principal Component (RLSP). Metode RLSP merupakan salah satu metode imputasi berbasis ketetanggan dan regresi. Proses imputasi dengan metode RLSP meliputi 3 tahap yaitu proses pemilihan k-nearest instance, proses Principal Component Analysis (PCA) dan proses estimasi MV berdasarkan hasil regresi median.Berdasarkan parameter Normalized Root Mean Squared Error (NRMSE) dan hasil klasifikasi data hasil imputasi, metode RLSP mampu menebak nilai MV mendekati nilai aktualnya. Performansi metode ini dipengaruhi oleh jumlah k-nearest instance dan Principal Component (PC). Jumlah k-nearest instance dan Principal Component (PC) optimal dicapai ketika data set memiliki variansi kecil dan record-record yang terpilih sebagai pengestimasi memiliki kemiripan yang tinggi dengan record ber-MV. Principal Component Analysis yang diterapkan dalam metode ini membuat metode ini cocok apabila digunakan pada data yang memiliki dimensi yang sangat besar. Selain itu, meskipun data mengandung outlier sampai 10%, metode ini masih mampu menebak nilai MV dengan baik.Kata Kunci : missing value, Robust Least Squares Estimation with Principal Component (RLSP), outlier,ABSTRACT: One way of handling missing value (MV) is imputation. Imputation is the process of filling missing value automatically using a specific algorithm. Imputation method is used in this Final Project is Robust Least Squares Estimation with Principal Component (RLSP). RLSP method combine nearest neighbor and regression to estimate missing value. Imputation process RLSP methods include 3 stages: selecting the k-nearest instances, Principal Component Analysis (PCA) process and estimation process based on the median regression results.Based on the parameter Normalized Root Mean Squared Error (NRMSE) and classification, the method is able to predict the missing value approaching the actual value. Performance of this method is influenced by the number of k-nearest instances and Principal Component (PC). The number of k-nearest instances and Principal Component (PC) optimum is achieved when the data set has a small variance and some records being selected as the estimator has a high similarity with the record which is contain missing value. PCA is applied in this method make this method suitable when used on data that has a very large dimension. In addition, although the data contain outliers up to 10%, this method is still able to predict the value of MV it well.Keyword: missing value, Robust Least Squares Estimation with Principal Component (RLSP), outlier, Normalize
Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models
For many least-squares decomposition models efficient algorithms are well known. A more difficult problem arises in decomposition models where each residual is weighted by a nonnegative value. A special case is principal components analysis with missing data. Kiers (1997) discusses an algorithm for minimizing weighteddecomposition models by iterative majorization. In this paper, we for computing a solution. We will show that the algorithm by Kiers is a special case of our algorithm. Here, we will apply weighted majorization to weighted principal components analysis, robust Procrustes analysis, and logistic bi-additive models of which the two parameter logistic model in item response theory is a specialcase. Simulation studies show that weighted majorization is generally faster than the method by Kiers by a factor one to four and obtains the same or better quality solutions. For logistic bi-additive models, we propose a new iterative majorization algorithm called logistic majorization.iterative majorization;IRT;logistic bi-additive model;robust Procrustes analysis;weighted principal component analysis;two parameter logistic model
Robust imputation method for missing values in microarray data
<p>Abstract</p> <p>Background</p> <p>When analyzing microarray gene expression data, missing values are often encountered. Most multivariate statistical methods proposed for microarray data analysis cannot be applied when the data have missing values. Numerous imputation algorithms have been proposed to estimate the missing values. In this study, we develop a robust least squares estimation with principal components (RLSP) method by extending the local least square imputation (LLSimpute) method. The basic idea of our method is to employ quantile regression to estimate the missing values, using the estimated principal components of a selected set of similar genes.</p> <p>Results</p> <p>Using the normalized root mean squares error, the performance of the proposed method was evaluated and compared with other previously proposed imputation methods. The proposed RLSP method clearly outperformed the weighted <it>k</it>-nearest neighbors imputation (kNNimpute) method and LLSimpute method, and showed competitive results with Bayesian principal component analysis (BPCA) method.</p> <p>Conclusion</p> <p>Adapting the principal components of the selected genes and employing the quantile regression model improved the robustness and accuracy of missing value imputation. Thus, the proposed RLSP method is, according to our empirical studies, more robust and accurate than the widely used kNNimpute and LLSimpute methods.</p
Robust Principal Component Analysis on Graphs
Principal Component Analysis (PCA) is the most widely used tool for linear
dimensionality reduction and clustering. Still it is highly sensitive to
outliers and does not scale well with respect to the number of data samples.
Robust PCA solves the first issue with a sparse penalty term. The second issue
can be handled with the matrix factorization model, which is however
non-convex. Besides, PCA based clustering can also be enhanced by using a graph
of data similarity. In this article, we introduce a new model called "Robust
PCA on Graphs" which incorporates spectral graph regularization into the Robust
PCA framework. Our proposed model benefits from 1) the robustness of principal
components to occlusions and missing values, 2) enhanced low-rank recovery, 3)
improved clustering property due to the graph smoothness assumption on the
low-rank matrix, and 4) convexity of the resulting optimization problem.
Extensive experiments on 8 benchmark, 3 video and 2 artificial datasets with
corruptions clearly reveal that our model outperforms 10 other state-of-the-art
models in its clustering and low-rank recovery tasks
Reliable Eigenspectra for New Generation Surveys
We present a novel technique to overcome the limitations of the applicability
of Principal Component Analysis to typical real-life data sets, especially
astronomical spectra. Our new approach addresses the issues of outliers,
missing information, large number of dimensions and the vast amount of data by
combining elements of robust statistics and recursive algorithms that provide
improved eigensystem estimates step-by-step. We develop a generic mechanism for
deriving reliable eigenspectra without manual data censoring, while utilising
all the information contained in the observations. We demonstrate the power of
the methodology on the attractive collection of the VIMOS VLT Deep Survey
spectra that manifest most of the challenges today, and highlight the
improvements over previous workarounds, as well as the scalability of our
approach to collections with sizes of the Sloan Digital Sky Survey and beyond.Comment: 7 pages, 3 figures, accepted to MNRA
- …