47,425 research outputs found
The Masked Sample Covariance Estimator: An Analysis via Matrix Concentration Inequalities
Covariance estimation becomes challenging in the regime where the number p of
variables outstrips the number n of samples available to construct the
estimate. One way to circumvent this problem is to assume that the covariance
matrix is nearly sparse and to focus on estimating only the significant
entries. To analyze this approach, Levina and Vershynin (2011) introduce a
formalism called masked covariance estimation, where each entry of the sample
covariance estimator is reweighted to reflect an a priori assessment of its
importance. This paper provides a short analysis of the masked sample
covariance estimator by means of a matrix concentration inequality. The main
result applies to general distributions with at least four moments. Specialized
to the case of a Gaussian distribution, the theory offers qualitative
improvements over earlier work. For example, the new results show that n = O(B
log^2 p) samples suffice to estimate a banded covariance matrix with bandwidth
B up to a relative spectral-norm error, in contrast to the sample complexity n
= O(B log^5 p) obtained by Levina and Vershynin
Four lectures on probabilistic methods for data science
Methods of high-dimensional probability play a central role in applications
for statistics, signal processing theoretical computer science and related
fields. These lectures present a sample of particularly useful tools of
high-dimensional probability, focusing on the classical and matrix Bernstein's
inequality and the uniform matrix deviation inequality. We illustrate these
tools with applications for dimension reduction, network analysis, covariance
estimation, matrix completion and sparse signal recovery. The lectures are
geared towards beginning graduate students who have taken a rigorous course in
probability but may not have any experience in data science applications.Comment: Lectures given at 2016 PCMI Graduate Summer School in Mathematics of
Data. Some typos, inaccuracies fixe
Sparsity and adaptivity for the blind separation of partially correlated sources
Blind source separation (BSS) is a very popular technique to analyze
multichannel data. In this context, the data are modeled as the linear
combination of sources to be retrieved. For that purpose, standard BSS methods
all rely on some discrimination principle, whether it is statistical
independence or morphological diversity, to distinguish between the sources.
However, dealing with real-world data reveals that such assumptions are rarely
valid in practice: the signals of interest are more likely partially
correlated, which generally hampers the performances of standard BSS methods.
In this article, we introduce a novel sparsity-enforcing BSS method coined
Adaptive Morphological Component Analysis (AMCA), which is designed to retrieve
sparse and partially correlated sources. More precisely, it makes profit of an
adaptive re-weighting scheme to favor/penalize samples based on their level of
correlation. Extensive numerical experiments have been carried out which show
that the proposed method is robust to the partial correlation of sources while
standard BSS techniques fail. The AMCA algorithm is evaluated in the field of
astrophysics for the separation of physical components from microwave data.Comment: submitted to IEEE Transactions on signal processin
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means
We analyze a compression scheme for large data sets that randomly keeps a
small percentage of the components of each data sample. The benefit is that the
output is a sparse matrix and therefore subsequent processing, such as PCA or
K-means, is significantly faster, especially in a distributed-data setting.
Furthermore, the sampling is single-pass and applicable to streaming data. The
sampling mechanism is a variant of previous methods proposed in the literature
combined with a randomized preconditioning to smooth the data. We provide
guarantees for PCA in terms of the covariance matrix, and guarantees for
K-means in terms of the error in the center estimators at a given step. We
present numerical evidence to show both that our bounds are nearly tight and
that our algorithms provide a real benefit when applied to standard test data
sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure
- …