145 research outputs found
Robust high-dimensional precision matrix estimation
The dependency structure of multivariate data can be analyzed using the
covariance matrix . In many fields the precision matrix
is even more informative. As the sample covariance estimator is singular in
high-dimensions, it cannot be used to obtain a precision matrix estimator. A
popular high-dimensional estimator is the graphical lasso, but it lacks
robustness. We consider the high-dimensional independent contamination model.
Here, even a small percentage of contaminated cells in the data matrix may lead
to a high percentage of contaminated rows. Downweighting entire observations,
which is done by traditional robust procedures, would then results in a loss of
information. In this paper, we formally prove that replacing the sample
covariance matrix in the graphical lasso with an elementwise robust covariance
matrix leads to an elementwise robust, sparse precision matrix estimator
computable in high-dimensions. Examples of such elementwise robust covariance
estimators are given. The final precision matrix estimator is positive
definite, has a high breakdown point under elementwise contamination and can be
computed fast
Robust Orthogonal Complement Principal Component Analysis
Recently, the robustification of principal component analysis has attracted
lots of attention from statisticians, engineers and computer scientists. In
this work we study the type of outliers that are not necessarily apparent in
the original observation space but can seriously affect the principal subspace
estimation. Based on a mathematical formulation of such transformed outliers, a
novel robust orthogonal complement principal component analysis (ROC-PCA) is
proposed. The framework combines the popular sparsity-enforcing and low rank
regularization techniques to deal with row-wise outliers as well as
element-wise outliers. A non-asymptotic oracle inequality guarantees the
accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle
the computational challenges, an efficient algorithm is developed on the basis
of Stiefel manifold optimization and iterative thresholding. Furthermore, a
batch variant is proposed to significantly reduce the cost in ultra high
dimensions. The paper also points out a pitfall of a common practice of SVD
reduction in robust PCA. Experiments show the effectiveness and efficiency of
ROC-PCA in both synthetic and real data
Principal component analysis for big data
Big data is transforming our world, revolutionizing operations and analytics
everywhere, from financial engineering to biomedical sciences. The complexity
of big data often makes dimension reduction techniques necessary before
conducting statistical inference. Principal component analysis, commonly
referred to as PCA, has become an essential tool for multivariate data analysis
and unsupervised dimension reduction, the goal of which is to find a lower
dimensional subspace that captures most of the variation in the dataset. This
article provides an overview of methodological and theoretical developments of
PCA over the last decade, with focus on its applications to big data analytics.
We first review the mathematical formulation of PCA and its theoretical
development from the view point of perturbation analysis. We then briefly
discuss the relationship between PCA and factor analysis as well as its
applications to large covariance estimation and multiple testing. PCA also
finds important applications in many modern machine learning problems, and we
focus on community detection, ranking, mixture model and manifold learning in
this paper.Comment: review article, in press with Wiley StatsRe
Robust Inference Under Heteroskedasticity via the Hadamard Estimator
Drawing statistical inferences from large datasets in a model-robust way is
an important problem in statistics and data science. In this paper, we propose
methods that are robust to large and unequal noise in different observational
units (i.e., heteroskedasticity) for statistical inference in linear
regression. We leverage the Hadamard estimator, which is unbiased for the
variances of ordinary least-squares regression. This is in contrast to the
popular White's sandwich estimator, which can be substantially biased in high
dimensions. We propose to estimate the signal strength, noise level,
signal-to-noise ratio, and mean squared error via the Hadamard estimator. We
develop a new degrees of freedom adjustment that gives more accurate confidence
intervals than variants of White's sandwich estimator. Moreover, we provide
conditions ensuring the estimator is well-defined, by studying a new random
matrix ensemble in which the entries of a random orthogonal projection matrix
are squared. We also show approximate normality, using the second-order
Poincare inequality. Our work provides improved statistical theory and methods
for linear regression in high dimensions
- …