28 research outputs found
Cellwise Robust M Regression
The cellwise robust M regression estimator is introduced as the first
estimator of its kind that intrinsically yields both a map of cellwise outliers
consistent with the linear model, and a vector of regression coefficients that
is robust against vertical outliers and leverage points. As a by-product, the
method yields a weighted and imputed data set that contains estimates of what
the values in cellwise outliers would need to amount to if they had fit the
model. The method is illustrated to be equally robust as its casewise
counterpart, MM regression. The cellwise regression method discards less
information than any casewise robust estimator. Therefore, predictive power can
be expected to be at least as good as casewise alternatives. These results are
corroborated in a simulation study. Moreover, while the simulations show that
predictive performance is at least on par with casewise methods if not better,
an application to a data set consisting of compositions of Swiss nutrients,
shows that in individual cases, CRM can achieve a significantly higher
predictive accuracy compared to MM regression
Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data
Correct classification of breast cancer sub-types is of high importance as it
directly affects the therapeutic options. We focus on triple-negative breast
cancer (TNBC) which has the worst prognosis among breast cancer types. Using
cutting edge methods from the field of robust statistics, we analyze Breast
Invasive Carcinoma (BRCA) transcriptomic data publicly available from The
Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical
outliers that may correspond to misdiagnosed patients. Furthermore, it is
illustrated that classical statistical methods may fail in the presence of
these outliers, prompting the need for robust statistics. Using robust sparse
logistic regression we obtain 36 relevant genes, of which ca. 60\% have been
previously reported as biologically relevant to TNBC, reinforcing the validity
of the method. The remaining 14 genes identified are new potential biomarkers
for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to
breast tumors or other types of cancer. The relevance of these genes is
confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A
comparison of gene networks on the selected genes showed significant
differences between TNBC and non-TNBC data. The individual role of FOXA1 in
TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not
only will our results contribute to the breast cancer/TNBC understanding and
ultimately its management, they also show that robust regression and outlier
detection constitute key strategies to cope with high-dimensional clinical data
such as omics data
Robust regression with compositional covariates including cellwise outliers
We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise robust compositional regression is performed to obtain model coefficient estimates. Simulations show that the procedure generally outperforms a traditional rowwise-only robust regression method (MM-estimator). Moreover, our procedure yields better or comparable results to recently proposed cellwise robust regression methods (shooting S-estimator, 3-step regression) while it is preferable for interpretation through the use of appropriate coordinate systems for compositional data. An application to bio-environmental data reveals that the proposed procedure—compared to other regression methods—leads to conclusions that are best aligned with established scientific knowledge
Robust high-dimensional precision matrix estimation
The dependency structure of multivariate data can be analyzed using the
covariance matrix . In many fields the precision matrix
is even more informative. As the sample covariance estimator is singular in
high-dimensions, it cannot be used to obtain a precision matrix estimator. A
popular high-dimensional estimator is the graphical lasso, but it lacks
robustness. We consider the high-dimensional independent contamination model.
Here, even a small percentage of contaminated cells in the data matrix may lead
to a high percentage of contaminated rows. Downweighting entire observations,
which is done by traditional robust procedures, would then results in a loss of
information. In this paper, we formally prove that replacing the sample
covariance matrix in the graphical lasso with an elementwise robust covariance
matrix leads to an elementwise robust, sparse precision matrix estimator
computable in high-dimensions. Examples of such elementwise robust covariance
estimators are given. The final precision matrix estimator is positive
definite, has a high breakdown point under elementwise contamination and can be
computed fast
On methods for prediction based on complex data with missing values and robust principal component analysis
Massive volumes of data are currently being generated, and at astonishing speed. Technological advances are making it cheaper and accessible for companies/institutions to obtain or generate large flows of data. These data can contain different types of complexities such as unobserved values, illogical values, extreme observations, among many others. On the other hand, sometimes researchers have limitations to obtain samples. For instance it can be costly to grow an organism in a lab. Therefore, a researcher may prefer to grow just a few of them at the expense of lower quality results. This type of data often has a large number of features measured in only a small number of observations so that the dimension of the data is much larger than its size. %Think for example of microarray data.
Very often practitioners are more concerned about the proper collection of the data than actually performing a correct data analysis. In this work we discuss methods for two relevant steps in data analysis. We first look at methods for the exploratory step where the practitioner wants to dig through the big flow of information to start understanding its structure and features. Next, we discuss methods for the statistical data analysis and focus on one of the most important tasks in this step: predicting an outcome. In this work we also want to address common complexities of real applications such as high-dimensional data, atypical data and missing values. More specifically, this thesis starts by discussing methods for principal component analysis which is one of the most popular exploratory tools. These methods are extensions of the classical principal components approach which are resistant to atypical data. Chapter \ref{Chapter1} describes the Multivariate S- and the Multivariate least trimmed squares estimators for principal components and proposes an algorithm which can yield more robust results and be computational faster for high-dimensional problems than existing algorithms for these methods and other robust methods. We show that the corresponding functionals are Fisher-consistent at elliptical distributions. Moreover, we study the robustness properties of the Multivariate S-estimator by deriving its influence function. The Multivariate S- and the Multivariate least trimmed squares estimators however only target casewise outliers, i.e. observations are either regular or outlying. Chapter \ref{Chapter2} introduces a new method for principal components that is shown to be more powerful against outliers: the coordinatewise least trimmed squares estimator. In particular, our proposal can handle cellwise outliers which is very common in modern high-dimensional datasets. We adapted our algorithm for the multivariate methods to fit coordinatewise least trimmed squares so that it can also be computed faster in higher dimensions. In addition, we introduce the functional of the estimator which can be shown to be Fisher-consistent at elliptical distributions. Chapter \ref{Chapter3} extends these three methods to the functional data setting and shows that these extensions preserve the robust characteristics of the methods in the multivariate setting. In Chapter \ref{Chapter4} we give some concluding remarks on the robust principal components procedures discussed in Chapters \ref{Chapter1}, \ref{Chapter2} and \ref{Chapter3}. The last chapter of the thesis covers the topic of prediction with missing data values. To make predictions we consider tree-based methods. Trees are a popular data mining technique that allows one to make predictions on data of different type and with missing values. We compare the prediction performance of tree-based techniques when the available training data contain features with missing values. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Both classification and regression problems are considered. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values, ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems.
Theoretical results confirm the potential better prediction performance of multiple imputation ensembles