Search CORE

145 research outputs found

Robust high-dimensional precision matrix estimation

Author: B. Bertsekas
C. Croux
C. Spearman
E. Ollila
E. Ollila
F.A. Alqallaf
G. Tarr
G.A.F. Seber
J. Friedman
K. Boudt
M. Yuan
M.A. Finegold
N. Blomqvist
N.J. Higham
O. Banerjee
P. Bühlmann
P.J. Rousseeuw
P.J. Rousseeuw
R. Gnanadesikan
R.A. Maronna
R.A. Maronna
S. Aelst Van
S. Visuri
T. Zhao
T.T. Cai
Publication venue
Publication date: 01/01/2015
Field of study

The dependency structure of multivariate data can be analyzed using the covariance matrix

\Sigma

. In many fields the precision matrix

\Sigma^{-1}

is even more informative. As the sample covariance estimator is singular in high-dimensions, it cannot be used to obtain a precision matrix estimator. A popular high-dimensional estimator is the graphical lasso, but it lacks robustness. We consider the high-dimensional independent contamination model. Here, even a small percentage of contaminated cells in the data matrix may lead to a high percentage of contaminated rows. Downweighting entire observations, which is done by traditional robust procedures, would then results in a loss of information. In this paper, we formally prove that replacing the sample covariance matrix in the graphical lasso with an elementwise robust covariance matrix leads to an elementwise robust, sparse precision matrix estimator computable in high-dimensions. Examples of such elementwise robust covariance estimators are given. The final precision matrix estimator is positive definite, has a high breakdown point under elementwise contamination and can be computed fast

arXiv.org e-Print Archive

Crossref

Robust Orthogonal Complement Principal Component Analysis

Author: Li Shijie
She Yiyuan
Wu Dapeng
Publication venue
Publication date: 27/01/2016
Field of study

Recently, the robustification of principal component analysis has attracted lots of attention from statisticians, engineers and computer scientists. In this work we study the type of outliers that are not necessarily apparent in the original observation space but can seriously affect the principal subspace estimation. Based on a mathematical formulation of such transformed outliers, a novel robust orthogonal complement principal component analysis (ROC-PCA) is proposed. The framework combines the popular sparsity-enforcing and low rank regularization techniques to deal with row-wise outliers as well as element-wise outliers. A non-asymptotic oracle inequality guarantees the accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle the computational challenges, an efficient algorithm is developed on the basis of Stiefel manifold optimization and iterative thresholding. Furthermore, a batch variant is proposed to significantly reduce the cost in ultra high dimensions. The paper also points out a pitfall of a common practice of SVD reduction in robust PCA. Experiments show the effectiveness and efficiency of ROC-PCA in both synthetic and real data

arXiv.org e-Print Archive

CiteSeerX

Principal component analysis for big data

Author: Abbe
Ahn
Anandkumar
Bai
Bai
Bai
Baik
Bair
Barras
Benjamini
Berk
Bickel
Borg
Bradley
Cai
Cai
Cai
Candés
Catoni
Chamberlain
Chandrasekaran
Coifman
Dambreville
Davis
de Leeuw
Dopico
Fama
Fama
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Fan
Giulini
Hotelling
Izenman
Johnstone
Jolliffe
Kendall
Keshavan
Koltchinskii
Lam
Lan
Lawley
Leek
Lowrimore
Minsker
Negahban
O'Rourke
Paul
Pearson
Perry
Qu
Ramsay
Rothman
Roweis
Shabalin
Sharpe
Stewart
Stewart
Stock
Storey
Tenenbaum
Torgerson
Tropp
Vershynin
Vu
Wang
Wang
Wedin
Yu
Zhang
Zhong
Publication venue
Publication date: 04/01/2018
Field of study

Big data is transforming our world, revolutionizing operations and analytics everywhere, from financial engineering to biomedical sciences. The complexity of big data often makes dimension reduction techniques necessary before conducting statistical inference. Principal component analysis, commonly referred to as PCA, has become an essential tool for multivariate data analysis and unsupervised dimension reduction, the goal of which is to find a lower dimensional subspace that captures most of the variation in the dataset. This article provides an overview of methodological and theoretical developments of PCA over the last decade, with focus on its applications to big data analytics. We first review the mathematical formulation of PCA and its theoretical development from the view point of perturbation analysis. We then briefly discuss the relationship between PCA and factor analysis as well as its applications to large covariance estimation and multiple testing. PCA also finds important applications in many modern machine learning problems, and we focus on community detection, ranking, mixture model and manifold learning in this paper.Comment: review article, in press with Wiley StatsRe

arXiv.org e-Print Archive

Crossref

Robust Inference Under Heteroskedasticity via the Hadamard Estimator

Author: Dobriban Edgar
Su Weijie J.
Publication venue
Publication date: 01/07/2018
Field of study

Drawing statistical inferences from large datasets in a model-robust way is an important problem in statistics and data science. In this paper, we propose methods that are robust to large and unequal noise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression. We leverage the Hadamard estimator, which is unbiased for the variances of ordinary least-squares regression. This is in contrast to the popular White's sandwich estimator, which can be substantially biased in high dimensions. We propose to estimate the signal strength, noise level, signal-to-noise ratio, and mean squared error via the Hadamard estimator. We develop a new degrees of freedom adjustment that gives more accurate confidence intervals than variants of White's sandwich estimator. Moreover, we provide conditions ensuring the estimator is well-defined, by studying a new random matrix ensemble in which the entries of a random orthogonal projection matrix are squared. We also show approximate normality, using the second-order Poincare inequality. Our work provides improved statistical theory and methods for linear regression in high dimensions

arXiv.org e-Print Archive