161 research outputs found
The Noisy Power Method: A Meta Algorithm with Applications
We provide a new robust convergence analysis of the well-known power method
for computing the dominant singular vectors of a matrix that we call the noisy
power method. Our result characterizes the convergence behavior of the
algorithm when a significant amount noise is introduced after each
matrix-vector multiplication. The noisy power method can be seen as a
meta-algorithm that has recently found a number of important applications in a
broad range of machine learning problems including alternating minimization for
matrix completion, streaming principal component analysis (PCA), and
privacy-preserving spectral analysis. Our general analysis subsumes several
existing ad-hoc convergence bounds and resolves a number of open problems in
multiple applications including streaming PCA and privacy-preserving singular
vector computation.Comment: NIPS 201
Preventing False Discovery in Interactive Data Analysis is Hard
We show that, under a standard hardness assumption, there is no
computationally efficient algorithm that given samples from an unknown
distribution can give valid answers to adaptively chosen
statistical queries. A statistical query asks for the expectation of a
predicate over the underlying distribution, and an answer to a statistical
query is valid if it is "close" to the correct expectation over the
distribution.
Our result stands in stark contrast to the well known fact that exponentially
many statistical queries can be answered validly and efficiently if the queries
are chosen non-adaptively (no query may depend on the answers to previous
queries). Moreover, a recent work by Dwork et al. shows how to accurately
answer exponentially many adaptively chosen statistical queries via a
computationally inefficient algorithm; and how to answer a quadratic number of
adaptive queries via a computationally efficient algorithm. The latter result
implies that our result is tight up to a linear factor in
Conceptually, our result demonstrates that achieving statistical validity
alone can be a source of computational intractability in adaptive settings. For
example, in the modern large collaborative research environment, data analysts
typically choose a particular approach based on previous findings. False
discovery occurs if a research finding is supported by the data but not by the
underlying distribution. While the study of preventing false discovery in
Statistics is decades old, to the best of our knowledge our result is the first
to demonstrate a computational barrier. In particular, our result suggests that
the perceived difficulty of preventing false discovery in today's collaborative
research environment may be inherent
Algorithms and Hardness for Robust Subspace Recovery
We consider a fundamental problem in unsupervised learning called
\emph{subspace recovery}: given a collection of points in ,
if many but not necessarily all of these points are contained in a
-dimensional subspace can we find it? The points contained in are
called {\em inliers} and the remaining points are {\em outliers}. This problem
has received considerable attention in computer science and in statistics. Yet
efficient algorithms from computer science are not robust to {\em adversarial}
outliers, and the estimators from robust statistics are hard to compute in high
dimensions.
Are there algorithms for subspace recovery that are both robust to outliers
and efficient? We give an algorithm that finds when it contains more than a
fraction of the points. Hence, for say this estimator
is both easy to compute and well-behaved when there are a constant fraction of
outliers. We prove that it is Small Set Expansion hard to find when the
fraction of errors is any larger, thus giving evidence that our estimator is an
{\em optimal} compromise between efficiency and robustness.
As it turns out, this basic problem has a surprising number of connections to
other areas including small set expansion, matroid theory and functional
analysis that we make use of here.Comment: Appeared in Proceedings of COLT 201
Beating Randomized Response on Incoherent Matrices
Computing accurate low rank approximations of large matrices is a fundamental
data mining task. In many applications however the matrix contains sensitive
information about individuals. In such case we would like to release a low rank
approximation that satisfies a strong privacy guarantee such as differential
privacy. Unfortunately, to date the best known algorithm for this task that
satisfies differential privacy is based on naive input perturbation or
randomized response: Each entry of the matrix is perturbed independently by a
sufficiently large random noise variable, a low rank approximation is then
computed on the resulting matrix.
We give (the first) significant improvements in accuracy over randomized
response under the natural and necessary assumption that the matrix has low
coherence. Our algorithm is also very efficient and finds a constant rank
approximation of an m x n matrix in time O(mn). Note that even generating the
noise matrix required for randomized response already requires time O(mn)
- …