203,177 research outputs found

    Robust imputation method for missing values in microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray gene expression data, missing values are often encountered. Most multivariate statistical methods proposed for microarray data analysis cannot be applied when the data have missing values. Numerous imputation algorithms have been proposed to estimate the missing values. In this study, we develop a robust least squares estimation with principal components (RLSP) method by extending the local least square imputation (LLSimpute) method. The basic idea of our method is to employ quantile regression to estimate the missing values, using the estimated principal components of a selected set of similar genes.</p> <p>Results</p> <p>Using the normalized root mean squares error, the performance of the proposed method was evaluated and compared with other previously proposed imputation methods. The proposed RLSP method clearly outperformed the weighted <it>k</it>-nearest neighbors imputation (kNNimpute) method and LLSimpute method, and showed competitive results with Bayesian principal component analysis (BPCA) method.</p> <p>Conclusion</p> <p>Adapting the principal components of the selected genes and employing the quantile regression model improved the robustness and accuracy of missing value imputation. Thus, the proposed RLSP method is, according to our empirical studies, more robust and accurate than the widely used kNNimpute and LLSimpute methods.</p

    Robust Principal Component Analysis on Graphs

    Get PDF
    Principal Component Analysis (PCA) is the most widely used tool for linear dimensionality reduction and clustering. Still it is highly sensitive to outliers and does not scale well with respect to the number of data samples. Robust PCA solves the first issue with a sparse penalty term. The second issue can be handled with the matrix factorization model, which is however non-convex. Besides, PCA based clustering can also be enhanced by using a graph of data similarity. In this article, we introduce a new model called "Robust PCA on Graphs" which incorporates spectral graph regularization into the Robust PCA framework. Our proposed model benefits from 1) the robustness of principal components to occlusions and missing values, 2) enhanced low-rank recovery, 3) improved clustering property due to the graph smoothness assumption on the low-rank matrix, and 4) convexity of the resulting optimization problem. Extensive experiments on 8 benchmark, 3 video and 2 artificial datasets with corruptions clearly reveal that our model outperforms 10 other state-of-the-art models in its clustering and low-rank recovery tasks

    On methods for prediction based on complex data with missing values and robust principal component analysis

    Get PDF
    Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data. Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles

    Machine Learning Developments in Dependency Modelling and Feature Extraction

    Get PDF
    Three complementary feature extraction approaches are developed in this thesis which addresses the challenge of dimensionality reduction in the presence of multivariate heavy-tailed and asymmetric distributions. First, we demonstrate how to improve the robustness of the standard Probabilistic Principal Component Analysis by adapting the concept of robust mean and covariance estimation within the standard framework. We then introduce feature extraction methods that extend the standard Principal Component Analysis by exploring distribution-based robustification. This is achieved via Probabilistic Principal Component Analysis (PPCA), in which new, statistically robust variants are derived, also treating missing data. We propose a novel generalisation to the t-Student Probabilistic Principal Component methodology which (1) accounts for asymmetric distribution of the observation data, (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible framework to model groups of marginal tail dependence in the observation data, and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss statistical properties of their robustness. In the next part of this thesis, we demonstrate the applicability of feature extraction methods to the statistical analysis of multidimensional dynamics. We introduce the class of Hybrid Factor models that combines classical state-space model formulations with incorporation of exogenous factors. We show how to utilize the information obtained from features extracted using introduced robust PPCA in a modelling framework in a meaningful and parsimonious manner. In the first application study, we show the applicability of robust feature extraction methods in the real data environment of financial markets and combine the obtained results with a stochastic multi-factor panel regression-based state-space model in order to model the dynamic of yield curves, whilst incorporating regression factors. We embed the rank-reduced feature extractions into a stochastic representation of state-space models for yield curve dynamics and compare the results to classical multi-factor dynamic Nelson-Siegel state-space models. This leads to important new representations of yield curve models that can have practical importance for addressing questions of financial stress testing and monetary policy interventions which can efficiently incorporate financial big data. We illustrate our results on various financial and macroeconomic data sets from the Euro Zone and international markets. In the second study, we develop a multi-factor extension of the family of Lee-Carter stochastic mortality models. We build upon the time, period and cohort stochastic model structure to include exogenous observable demographic features that can be used as additional factors to improve model fit and forecasting accuracy. We develop a framework in which (a) we employ projection-based techniques of dimensionality reduction that are amenable to different structures of demographic data; (b) we analyse demographic data sets from the patterns of missingness and the impact of such missingness on the feature extraction; (c) we introduce a class of multi-factor stochastic mortality models incorporating time, period, cohort and demographic features, which we develop within a Bayesian state-space estimation framework. Finally (d) we develop an efficient combined Markov chain and filtering framework for sampling the posterior and forecasting. We undertake a detailed case study on the Human Mortality Database demographic data from European countries and we use the extracted features to better explain the term structure of mortality in the UK over time for male and female populations. This is compared to a pure Lee-Carter stochastic mortality model, demonstrating that our feature extraction framework and consequent multi-factor mortality model improves both in-sample fit and, importantly, out-of-sample mortality forecasts by a non-trivial gain in performance

    Robust regularized singular value decomposition with application to mortality data

    Get PDF
    We develop a robust regularized singular value decomposition (RobRSVD) method for analyzing two-way functional data. The research is motivated by the application of modeling human mortality as a smooth two-way function of age group and year. The RobRSVD is formulated as a penalized loss minimization problem where a robust loss function is used to measure the reconstruction error of a low-rank matrix approximation of the data, and an appropriately defined two-way roughness penalty function is used to ensure smoothness along each of the two functional domains. By viewing the minimization problem as two conditional regularized robust regressions, we develop a fast iterative reweighted least squares algorithm to implement the method. Our implementation naturally incorporates missing values. Furthermore, our formulation allows rigorous derivation of leave-one-row/column-out cross-validation and generalized cross-validation criteria, which enable computationally efficient data-driven penalty parameter selection. The advantages of the new robust method over nonrobust ones are shown via extensive simulation studies and the mortality rate application.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS649 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore