27 research outputs found

    Robust high-dimensional precision matrix estimation

    Full text link
    The dependency structure of multivariate data can be analyzed using the covariance matrix Σ\Sigma. In many fields the precision matrix Σ−1\Sigma^{-1} is even more informative. As the sample covariance estimator is singular in high-dimensions, it cannot be used to obtain a precision matrix estimator. A popular high-dimensional estimator is the graphical lasso, but it lacks robustness. We consider the high-dimensional independent contamination model. Here, even a small percentage of contaminated cells in the data matrix may lead to a high percentage of contaminated rows. Downweighting entire observations, which is done by traditional robust procedures, would then results in a loss of information. In this paper, we formally prove that replacing the sample covariance matrix in the graphical lasso with an elementwise robust covariance matrix leads to an elementwise robust, sparse precision matrix estimator computable in high-dimensions. Examples of such elementwise robust covariance estimators are given. The final precision matrix estimator is positive definite, has a high breakdown point under elementwise contamination and can be computed fast

    Robust and sparse estimation of high-dimensional precision matrices via bivariate outlier detection

    Get PDF
    Robust estimation of Gaussian Graphical models in the high-dimensional setting is becoming increasingly important since large and real data may contain outlying observations. These outliers can lead to drastically wrong inference on the intrinsic graph structure. Several procedures apply univariate transformations to make the data Gaussian distributed. However, these transformations do not work well under the presence of structural bivariate outliers. We propose a robust precision matrix estimator under the cellwise contamination mechanism that is robust against structural bivariate outliers. This estimator exploits robust pairwise weighted correlation coefficient estimates, where the weights are computed by the Mahalanobis distance with respect to an affine equivariant robust correlation coefficient estimator. We show that the convergence rate of the proposed estimator is the same as the correlation coefficient used to compute the Mahalanobis distance. We conduct numerical simulation under different contamination settings to compare the graph recovery performance of different robust estimators. Finally, the proposed method is then applied to the classification of tumors using gene expression data. We show that our procedure can effectively recover the true graph under cellwise data contamination.Acknowledgements: the authors acknowledge financial support from the Spanish Ministry of Education and Science, research project MTM2013-44902-P

    Robust and sparse estimation of large precision matrices

    Get PDF
    The thesis considers the estimation of sparse precision matrices in the highdimensional setting. First, we introduce an integrated approach to estimate undirected graphs and to perform model selection in high-dimensional Gaussian Graphical Models (GGMs). The approach is based on a parametrization of the inverse covariance matrix in terms of the prediction errors of the best linear predictor of each node in the graph. We exploit the relationship between partial correlation coefficients and the distribution of the prediction errors to propose a novel forward-backward algorithm for detecting pairs of variables having nonzero partial correlations among a large number of random variables based on i.i.d. samples. Then, we are able to establish asymptotic properties under mild conditions. Finally, numerical studies through simulation and real data examples provide evidence of the practical advantage of the procedure, where the proposed approach outperforms state-of-the-art methods such as the Graphical lasso and CLIME under different settings. Furthermore, we study the problem of robust estimation of GGMs in the highdimensional setting when the data may contain outlying observations. We propose a robust precision matrix estimator under the cellwise contamination mechanism that is robust against structural bivariate outliers. This framework exploits robust pairwise weighted correlation coefficient estimates, where the weights are computed by the Mahalanobis distance with respect to an affine equivariant robust correlation coefficient estimator. We show that the convergence rate of the proposed estimator is the same as the correlation coefficient used to compute the Mahalanobis distance. We conduct numerical simulation under different contamination settings to compare the graph recovery performance of different robust estimators. The proposed method is then applied to the classiffication of tumors using gene expression data. We show that our procedure can effectively recover the true graph under cellwise data contamination.Programa Oficial de Doctorado en Economía de la Empresa y Métodos CuantitativosPresidente: José Manuel Mira Mcwilliams; Secretario: Andrés Modesto Alonso Fernández; Vocal: José Ramón Berrendero Día

    The Cellwise Minimum Covariance Determinant Estimator

    Get PDF
    The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more outlying cells, the other cells in the same row still contain useful information that we wish to preserve. We propose a cellwise robust version of the MCD method, called cellMCD. Its main building blocks are observed likelihood and a sparsity penalty on the number of flagged cellwise outliers. It possesses good breakdown properties. We construct a fast algorithm for cellMCD based on concentration steps (C-steps) that always lower the objective. The method performs well in simulations with cellwise outliers, and has high finite-sample efficiency on clean data. It is illustrated on real data with visualizations of the results

    Robust estimation and variable selection for cellwise contaminated data

    Get PDF
    Outliers are widespread in real-world datasets. Recognizing outliers and running robust analyses is still a challenging topic. Recently, there has been increased attention on cellwise outliers. In contrast to traditional rowwise (observation-wise) outliers, cellwise outliers target individual cells within observations of a dataset, where only specific cells within each row may be contaminated. Several challenges need to be addressed in this field of research, such as outlier detection, robust covariance matrix estimation and robust (sparse) regression. We introduce a Gaussian rank based Lasso estimator, which uses the Gaussian rank correlation to obtain an initial empirical covariance matrix among the response and potential active predictors. We re-parameterise the classical linear regression model design matrix and the response vector to take advantage of these robustly estimated components before applying the adaptive Lasso to obtain consistent variable selection results. We also introduce cellwise regularized Lasso, a regularized regression method to address cellwise outliers, which employs a cellwise shrinkage procedure that shrinks outlying cells based on the magnitude of regression residuals and cell deviations

    Topics In Multivariate Statistics

    Get PDF
    Multivariate statistics concerns the study of dependence relations among multiple variables of interest. Distinct from widely studied regression problems where one of the variables is singled out as a response, in multivariate analysis all variables are treated symmetrically and the dependency structures are examined, either for interest in its own right or for further analyses such as regressions. This thesis includes the study of three independent research problems in multivariate statistics. The first part of the thesis studies additive principal components (APCs for short), a nonlinear method useful for exploring additive relationships among a set of variables. We propose a shrinkage regularization approach for estimating APC transformations by casting the problem in the framework of reproducing kernel Hilbert spaces. To formulate the kernel APC problem, we introduce the Null Comparison Principle, a principle that ties the constraint in a multivariate problem to its criterion in a way that makes the goal of the multivariate method under study transparent. In addition to providing a detailed formulation and exposition of the kernel APC problem, we study asymptotic theory of kernel APCs. Our theory also motivates an iterative algorithm for computing kernel APCs. The second part of the thesis investigates the estimation of precision matrices in high dimensions when the data is corrupted in a cellwise manner and the uncontaminated data follows a multivariate normal distribution. It is known that in the setting of Gaussian graphical models, the conditional independence relations among variables is captured by the precision matrix of a multivariate normal distribution, and estimating the support of the precision matrix is equivalent to graphical model selection. In this work, we analyze the theoretical properties of robust estimators for precision matrices in high dimensions. The estimators we analyze are formed by plugging appropriately chosen robust covariance matrix estimators into the graphical Lasso and CLIME, two existing methods for high-dimensional precision matrix estimation. We establish error bounds for the precision matrix estimators that reveal the interplay between the dimensionality of the problem and the degree of contamination permitted in the observed distribution, and also analyze the breakdown point of both estimators. We also discuss implications of our work for Gaussian graphical model estimation in the presence of cellwise contamination. The third part of the thesis studies the problem of optimal estimation of a quadratic functional under the Gaussian two-sequence model. Quadratic functional estimation has been well studied under the Gaussian sequence model, and close connections between the problem of quadratic functional estimation and that of signal detection have been noted. Focusing on the estimation problem in the Gaussian two-sequence model, in this work we propose optimal estimators of the quadratic functional for different regimes and establish the minimax rates of convergence over a family of parameter spaces. The optimal rates exhibit interesting phase transition in this family. We also discuss the implications of our estimation results on the associated simultaneous signal detection problem

    Convex Parameter Estimation of Perturbed Multivariate Generalized Gaussian Distributions

    Full text link
    The multivariate generalized Gaussian distribution (MGGD), also known as the multivariate exponential power (MEP) distribution, is widely used in signal and image processing. However, estimating MGGD parameters, which is required in practical applications, still faces specific theoretical challenges. In particular, establishing convergence properties for the standard fixed-point approach when both the distribution mean and the scatter (or the precision) matrix are unknown is still an open problem. In robust estimation, imposing classical constraints on the precision matrix, such as sparsity, has been limited by the non-convexity of the resulting cost function. This paper tackles these issues from an optimization viewpoint by proposing a convex formulation with well-established convergence properties. We embed our analysis in a noisy scenario where robustness is induced by modelling multiplicative perturbations. The resulting framework is flexible as it combines a variety of regularizations for the precision matrix, the mean and model perturbations. This paper presents proof of the desired theoretical properties, specifies the conditions preserving these properties for different regularization choices and designs a general proximal primal-dual optimization strategy. The experiments show a more accurate precision and covariance matrix estimation with similar performance for the mean vector parameter compared to Tyler's M-estimator. In a high-dimensional setting, the proposed method outperforms the classical GLASSO, one of its robust extensions, and the regularized Tyler's estimator
    corecore