120 research outputs found
Algorithms and Hardness for Robust Subspace Recovery
We consider a fundamental problem in unsupervised learning called
\emph{subspace recovery}: given a collection of points in ,
if many but not necessarily all of these points are contained in a
-dimensional subspace can we find it? The points contained in are
called {\em inliers} and the remaining points are {\em outliers}. This problem
has received considerable attention in computer science and in statistics. Yet
efficient algorithms from computer science are not robust to {\em adversarial}
outliers, and the estimators from robust statistics are hard to compute in high
dimensions.
Are there algorithms for subspace recovery that are both robust to outliers
and efficient? We give an algorithm that finds when it contains more than a
fraction of the points. Hence, for say this estimator
is both easy to compute and well-behaved when there are a constant fraction of
outliers. We prove that it is Small Set Expansion hard to find when the
fraction of errors is any larger, thus giving evidence that our estimator is an
{\em optimal} compromise between efficiency and robustness.
As it turns out, this basic problem has a surprising number of connections to
other areas including small set expansion, matroid theory and functional
analysis that we make use of here.Comment: Appeared in Proceedings of COLT 201
A Polynomial Time Algorithm for Lossy Population Recovery
We give a polynomial time algorithm for the lossy population recovery
problem. In this problem, the goal is to approximately learn an unknown
distribution on binary strings of length from lossy samples: for some
parameter each coordinate of the sample is preserved with probability
and otherwise is replaced by a `?'. The running time and number of
samples needed for our algorithm is polynomial in and for
each fixed . This improves on algorithm of Wigderson and Yehudayoff that
runs in quasi-polynomial time for any and the polynomial time
algorithm of Dvir et al which was shown to work for by
Batman et al. In fact, our algorithm also works in the more general framework
of Batman et al. in which there is no a priori bound on the size of the support
of the distribution. The algorithm we analyze is implicit in previous work; our
main contribution is to analyze the algorithm by showing (via linear
programming duality and connections to complex analysis) that a certain matrix
associated with the problem has a robust local inverse even though its
condition number is exponentially small. A corollary of our result is the first
polynomial time algorithm for learning DNFs in the restriction access model of
Dvir et al
A solution to the Papadimitriou-Ratajczak conjecture
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 32-33).Geographic Routing is a family of routing algorithms that uses geographic point locations as addresses for the purposes of routing. Such routing algorithms have proven to be both simple to implement and heuristically effective when applied to wireless sensor networks. Greedy Routing is a natural abstraction of this model in which nodes are assigned virtual coordinates in a metric space, aid these coordinates are used to perform point-to-point routing. Here we resolve a conjecture of Papadimitrion and Ratajczak that every 3-connected planar graph admits a greedy embedding into the Eulidean plane. This immediately implies that all 3-connected graphs that exclude K₃,₃ as a minor admit a greedy embedding into the Euclidean plane. Additionally, we provide the first non-trivial examples of graphs that admit no such embedding. These structural results provide efficiently verifiable certificates that a graph admits a greedy embedding or that a graph admits no greedy embedding into the Euclideau plane.by Ankur Moitra.S.M
An Almost Optimal Algorithm for Computing Nonnegative Rank
Here, we give an algorithm for deciding if the nonnegative rank of a matrix M of dimension m \times n$ is at most r which runs in time (nm)[superscript O(r2)]. This is the first exact algorithm that runs in time singly exponential in r. This algorithm (and earlier algorithms) are built on methods for finding a solution to a system of polynomial inequalities (if one exists). Notably, the best algorithms for this task run in time exponential in the number of variables but polynomial in all of the other parameters (the number of inequalities and the maximum degree). Hence, these algorithms motivate natural algebraic questions whose solution have immediate algorithmic implications: How many variables do we need to represent the decision problem, and does M have nonnegative rank at most r? A naive formulation uses nr + mr variables and yields an algorithm that is exponential in n and m even for constant r. Arora et al. [Proceedings of STOC, 2012, pp. 145--162] recently reduced the number of variables to 2r[superscript 2] 2[superscript r], and here we exponentially reduce the number of variables to 2r[superscript 2] and this yields our main algorithm. In fact, the algorithm that we obtain is nearly optimal (under the exponential time hypothesis) since an algorithm that runs in time (nm)[superscript o(r)] would yield a subexponential algorithm for 3-SAT [Proceedings of STOC, 2012, pp. 145--162]. Our main result is based on establishing a normal form for nonnegative matrix factorization---which in turn allows us to exploit algebraic dependence among a large collection of linear transformations with variable entries. Additionally, we also demonstrate that nonnegative rank cannot be certified by even a very large submatrix of M, and this property also follows from the intuition gained from viewing nonnegative rank through the lens of systems of polynomial inequalities.National Science Foundation (U.S.) (Computing and Innovation Fellowship)National Science Foundation (U.S.) (grant DMS-0835373
Learning Topic Models - Going beyond SVD
Topic Modeling is an approach used for automatic comprehension and
classification of data in a variety of settings, and perhaps the canonical
application is in uncovering thematic structure in a corpus of documents. A
number of foundational works both in machine learning and in theory have
suggested a probabilistic model for documents, whereby documents arise as a
convex combination of (i.e. distribution on) a small number of topic vectors,
each topic vector being a distribution on words (i.e. a vector of
word-frequencies). Similar models have since been used in a variety of
application areas; the Latent Dirichlet Allocation or LDA model of Blei et al.
is especially popular.
Theoretical studies of topic modeling focus on learning the model's
parameters assuming the data is actually generated from it. Existing approaches
for the most part rely on Singular Value Decomposition(SVD), and consequently
have one of two limitations: these works need to either assume that each
document contains only one topic, or else can only recover the span of the
topic vectors instead of the topic vectors themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main
tool in this context, which is an analog of SVD where all vectors are
nonnegative. Using this tool we give the first polynomial-time algorithm for
learning topic models without the above two limitations. The algorithm uses a
fairly mild assumption about the underlying topic matrix called separability,
which is usually found to hold in real-life data. A compelling feature of our
algorithm is that it generalizes to models that incorporate topic-topic
correlations, such as the Correlated Topic Model and the Pachinko Allocation
Model.
We hope that this paper will motivate further theoretical results that use
NMF as a replacement for SVD - just as NMF has come to replace SVD in many
applications
Tensor Completion Made Practical
Tensor completion is a natural higher-order generalization of matrix
completion where the goal is to recover a low-rank tensor from sparse
observations of its entries. Existing algorithms are either heuristic without
provable guarantees, based on solving large semidefinite programs which are
impractical to run, or make strong assumptions such as requiring the factors to
be nearly orthogonal. In this paper we introduce a new variant of alternating
minimization, which in turn is inspired by understanding how the progress
measures that guide convergence of alternating minimization in the matrix
setting need to be adapted to the tensor setting. We show strong provable
guarantees, including showing that our algorithm converges linearly to the true
tensors even when the factors are highly correlated and can be implemented in
nearly linear time. Moreover our algorithm is also highly practical and we show
that we can complete third order tensors with a thousand dimensions from
observing a tiny fraction of its entries. In contrast, and somewhat
surprisingly, we show that the standard version of alternating minimization,
without our new twist, can converge at a drastically slower rate in practice
- …