6,098 research outputs found
Optimally Weighted PCA for High-Dimensional Heteroscedastic Data
Modern applications increasingly involve high-dimensional and heterogeneous
data, e.g., datasets formed by combining numerous measurements from myriad
sources. Principal Component Analysis (PCA) is a classical method for reducing
dimensionality by projecting such data onto a low-dimensional subspace
capturing most of their variation, but PCA does not robustly recover underlying
subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers
from treating all data samples as if they are equally informative. This paper
analyzes a weighted variant of PCA that accounts for heteroscedasticity by
giving samples with larger noise variance less influence. The analysis provides
expressions for the asymptotic recovery of underlying low-dimensional
components from samples with heteroscedastic noise in the high-dimensional
regime, i.e., for sample dimension on the order of the number of samples.
Surprisingly, it turns out that whitening the noise by using inverse noise
variance weights is suboptimal. We derive optimal weights, characterize the
performance of weighted PCA, and consider the problem of optimally collecting
samples under budget constraints.Comment: 52 pages, 13 figure
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
Robust computation of linear models by convex relaxation
Consider a dataset of vector-valued observations that consists of noisy
inliers, which are explained well by a low-dimensional subspace, along with
some number of outliers. This work describes a convex optimization problem,
called REAPER, that can reliably fit a low-dimensional model to this type of
data. This approach parameterizes linear subspaces using orthogonal projectors,
and it uses a relaxation of the set of orthogonal projectors to reach the
convex formulation. The paper provides an efficient algorithm for solving the
REAPER problem, and it documents numerical experiments which confirm that
REAPER can dependably find linear structure in synthetic and natural data. In
addition, when the inliers lie near a low-dimensional subspace, there is a
rigorous theory that describes when REAPER can approximate this subspace.Comment: Formerly titled "Robust computation of linear models, or How to find
a needle in a haystack
Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood
We discuss the problem of estimating the number of principal components in
Principal Com- ponents Analysis (PCA). Despite of the importance of the problem
and the multitude of solutions proposed in the literature, it comes as a
surprise that there does not exist a coherent asymptotic framework which would
justify different approaches depending on the actual size of the data set. In
this paper we address this issue by presenting an approximate Bayesian approach
based on Laplace approximation and introducing a general method for building
the model selection criteria, called PEnalized SEmi-integrated Likelihood
(PESEL). Our general framework encompasses a variety of existing approaches
based on probabilistic models, like e.g. Bayesian Information Criterion for the
Probabilistic PCA (PPCA), and allows for construction of new criteria,
depending on the size of the data set at hand. Specifically, we define PESEL
when the number of variables substantially exceeds the number of observations.
We also report results of extensive simulation studies and real data analysis,
which illustrate good properties of our proposed criteria as compared to the
state-of- the-art methods and very recent proposals. Specifially, these
simulations show that PESEL based criteria can be quite robust against
deviations from the probabilistic model assumptions. Selected PESEL based
criteria for the estimation of the number of principal components are
implemented in R package varclust, which is available on github
(https://github.com/psobczyk/varclust).Comment: 31 pages, 7 figure
OptShrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage
The truncated singular value decomposition (SVD) of the measurement matrix is
the optimal solution to the_representation_ problem of how to best approximate
a noisy measurement matrix using a low-rank matrix. Here, we consider the
(unobservable)_denoising_ problem of how to best approximate a low-rank signal
matrix buried in noise by optimal (re)weighting of the singular vectors of the
measurement matrix. We exploit recent results from random matrix theory to
exactly characterize the large matrix limit of the optimal weighting
coefficients and show that they can be computed directly from data for a large
class of noise models that includes the i.i.d. Gaussian noise case.
Our analysis brings into sharp focus the shrinkage-and-thresholding form of
the optimal weights, the non-convex nature of the associated shrinkage function
(on the singular values) and explains why matrix regularization via singular
value thresholding with convex penalty functions (such as the nuclear norm)
will always be suboptimal. We validate our theoretical predictions with
numerical simulations, develop an implementable algorithm (OptShrink) that
realizes the predicted performance gains and show how our methods can be used
to improve estimation in the setting where the measured matrix has missing
entries.Comment: Published version. The algorithm can be downloaded from
http://www.eecs.umich.edu/~rajnrao/optshrin
- …