1,326 research outputs found
Algorithms and Hardness for Robust Subspace Recovery
We consider a fundamental problem in unsupervised learning called
\emph{subspace recovery}: given a collection of points in ,
if many but not necessarily all of these points are contained in a
-dimensional subspace can we find it? The points contained in are
called {\em inliers} and the remaining points are {\em outliers}. This problem
has received considerable attention in computer science and in statistics. Yet
efficient algorithms from computer science are not robust to {\em adversarial}
outliers, and the estimators from robust statistics are hard to compute in high
dimensions.
Are there algorithms for subspace recovery that are both robust to outliers
and efficient? We give an algorithm that finds when it contains more than a
fraction of the points. Hence, for say this estimator
is both easy to compute and well-behaved when there are a constant fraction of
outliers. We prove that it is Small Set Expansion hard to find when the
fraction of errors is any larger, thus giving evidence that our estimator is an
{\em optimal} compromise between efficiency and robustness.
As it turns out, this basic problem has a surprising number of connections to
other areas including small set expansion, matroid theory and functional
analysis that we make use of here.Comment: Appeared in Proceedings of COLT 201
Fourier PCA and Robust Tensor Decomposition
Fourier PCA is Principal Component Analysis of a matrix obtained from higher
order derivatives of the logarithm of the Fourier transform of a
distribution.We make this method algorithmic by developing a tensor
decomposition method for a pair of tensors sharing the same vectors in rank-
decompositions. Our main application is the first provably polynomial-time
algorithm for underdetermined ICA, i.e., learning an matrix
from observations where is drawn from an unknown product
distribution with arbitrary non-Gaussian components. The number of component
distributions can be arbitrarily higher than the dimension and the
columns of only need to satisfy a natural and efficiently verifiable
nondegeneracy condition. As a second application, we give an alternative
algorithm for learning mixtures of spherical Gaussians with linearly
independent means. These results also hold in the presence of Gaussian noise.Comment: Extensively revised; details added; minor errors corrected;
exposition improve
Non-Gaussian Component Analysis using Entropy Methods
Non-Gaussian component analysis (NGCA) is a problem in multidimensional data
analysis which, since its formulation in 2006, has attracted considerable
attention in statistics and machine learning. In this problem, we have a random
variable in -dimensional Euclidean space. There is an unknown subspace
of the -dimensional Euclidean space such that the orthogonal
projection of onto is standard multidimensional Gaussian and the
orthogonal projection of onto , the orthogonal complement
of , is non-Gaussian, in the sense that all its one-dimensional
marginals are different from the Gaussian in a certain metric defined in terms
of moments. The NGCA problem is to approximate the non-Gaussian subspace
given samples of .
Vectors in correspond to `interesting' directions, whereas
vectors in correspond to the directions where data is very noisy. The
most interesting applications of the NGCA model is for the case when the
magnitude of the noise is comparable to that of the true signal, a setting in
which traditional noise reduction techniques such as PCA don't apply directly.
NGCA is also related to dimension reduction and to other data analysis problems
such as ICA. NGCA-like problems have been studied in statistics for a long time
using techniques such as projection pursuit.
We give an algorithm that takes polynomial time in the dimension and has
an inverse polynomial dependence on the error parameter measuring the angle
distance between the non-Gaussian subspace and the subspace output by the
algorithm. Our algorithm is based on relative entropy as the contrast function
and fits under the projection pursuit framework. The techniques we develop for
analyzing our algorithm maybe of use for other related problems
Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity
We present a simple, general technique for reducing the sample complexity of
matrix and tensor decomposition algorithms applied to distributions. We use the
technique to give a polynomial-time algorithm for standard ICA with sample
complexity nearly linear in the dimension, thereby improving substantially on
previous bounds. The analysis is based on properties of random polynomials,
namely the spacings of an ensemble of polynomials. Our technique also applies
to other applications of tensor decompositions, including spherical Gaussian
mixture models
- …