28 research outputs found

    Cellwise Robust M Regression

    Full text link
    The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in cellwise outliers would need to amount to if they had fit the model. The method is illustrated to be equally robust as its casewise counterpart, MM regression. The cellwise regression method discards less information than any casewise robust estimator. Therefore, predictive power can be expected to be at least as good as casewise alternatives. These results are corroborated in a simulation study. Moreover, while the simulations show that predictive performance is at least on par with casewise methods if not better, an application to a data set consisting of compositions of Swiss nutrients, shows that in individual cases, CRM can achieve a significantly higher predictive accuracy compared to MM regression

    Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

    Get PDF
    Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail in the presence of these outliers, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60\% have been previously reported as biologically relevant to TNBC, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between TNBC and non-TNBC data. The individual role of FOXA1 in TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not only will our results contribute to the breast cancer/TNBC understanding and ultimately its management, they also show that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data

    Robust regression with compositional covariates including cellwise outliers

    Get PDF
    We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise robust compositional regression is performed to obtain model coefficient estimates. Simulations show that the procedure generally outperforms a traditional rowwise-only robust regression method (MM-estimator). Moreover, our procedure yields better or comparable results to recently proposed cellwise robust regression methods (shooting S-estimator, 3-step regression) while it is preferable for interpretation through the use of appropriate coordinate systems for compositional data. An application to bio-environmental data reveals that the proposed procedure—compared to other regression methods—leads to conclusions that are best aligned with established scientific knowledge

    Robust high-dimensional precision matrix estimation

    Full text link
    The dependency structure of multivariate data can be analyzed using the covariance matrix Σ\Sigma. In many fields the precision matrix Σ−1\Sigma^{-1} is even more informative. As the sample covariance estimator is singular in high-dimensions, it cannot be used to obtain a precision matrix estimator. A popular high-dimensional estimator is the graphical lasso, but it lacks robustness. We consider the high-dimensional independent contamination model. Here, even a small percentage of contaminated cells in the data matrix may lead to a high percentage of contaminated rows. Downweighting entire observations, which is done by traditional robust procedures, would then results in a loss of information. In this paper, we formally prove that replacing the sample covariance matrix in the graphical lasso with an elementwise robust covariance matrix leads to an elementwise robust, sparse precision matrix estimator computable in high-dimensions. Examples of such elementwise robust covariance estimators are given. The final precision matrix estimator is positive definite, has a high breakdown point under elementwise contamination and can be computed fast

    On methods for prediction based on complex data with missing values and robust principal component analysis

    Get PDF
    Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data. Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles
    corecore