Principal component analysis (PCA) is one of the most fundamental procedures
in exploratory data analysis and is the basic step in applications ranging from
quantitative finance and bioinformatics to image analysis and neuroscience.
However, it is well-documented that the applicability of PCA in many real
scenarios could be constrained by an "immune deficiency" to outliers such as
corrupted observations. We consider the following algorithmic question about
the PCA with outliers. For a set of n points in Rd, how to
learn a subset of points, say 1% of the total number of points, such that the
remaining part of the points is best fit into some unknown r-dimensional
subspace? We provide a rigorous algorithmic analysis of the problem. We show
that the problem is solvable in time nO(d2). In particular, for constant
dimension the problem is solvable in polynomial time. We complement the
algorithmic result by the lower bound, showing that unless Exponential Time
Hypothesis fails, in time f(d)no(d), for any function f of d, it is
impossible not only to solve the problem exactly but even to approximate it
within a constant factor.Comment: To be presented at ICML 201