35,129 research outputs found
Robust Near-Separable Nonnegative Matrix Factorization Using Linear Optimization
Nonnegative matrix factorization (NMF) has been shown recently to be
tractable under the separability assumption, under which all the columns of the
input data matrix belong to the convex cone generated by only a few of these
columns. Bittorf, Recht, R\'e and Tropp (`Factoring nonnegative matrices with
linear programs', NIPS 2012) proposed a linear programming (LP) model, referred
to as Hottopixx, which is robust under any small perturbation of the input
matrix. However, Hottopixx has two important drawbacks: (i) the input matrix
has to be normalized, and (ii) the factorization rank has to be known in
advance. In this paper, we generalize Hottopixx in order to resolve these two
drawbacks, that is, we propose a new LP model which does not require
normalization and detects the factorization rank automatically. Moreover, the
new LP model is more flexible, significantly more tolerant to noise, and can
easily be adapted to handle outliers and other noise models. Finally, we show
on several synthetic datasets that it outperforms Hottopixx while competing
favorably with two state-of-the-art methods.Comment: 27 page; 4 figures. New Example, new experiment on the Swimmer data
se
Fast and Robust Recursive Algorithms for Separable Nonnegative Matrix Factorization
In this paper, we study the nonnegative matrix factorization problem under
the separability assumption (that is, there exists a cone spanned by a small
subset of the columns of the input nonnegative data matrix containing all
columns), which is equivalent to the hyperspectral unmixing problem under the
linear mixing model and the pure-pixel assumption. We present a family of fast
recursive algorithms, and prove they are robust under any small perturbations
of the input data matrix. This family generalizes several existing
hyperspectral unmixing algorithms and hence provides for the first time a
theoretical justification of their better practical performance.Comment: 30 pages, 2 figures, 7 tables. Main change: Improvement of the bound
of the main theorem (Th. 3), replacing r with sqrt(r
Factoring nonnegative matrices with linear programs
This paper describes a new approach, based on linear programming, for
computing nonnegative matrix factorizations (NMFs). The key idea is a
data-driven model for the factorization where the most salient features in the
data are used to express the remaining features. More precisely, given a data
matrix X, the algorithm identifies a matrix C such that X approximately equals
CX and some linear constraints. The constraints are chosen to ensure that the
matrix C selects features; these features can then be used to find a low-rank
NMF of X. A theoretical analysis demonstrates that this approach has guarantees
similar to those of the recent NMF algorithm of Arora et al. (2012). In
contrast with this earlier work, the proposed method extends to more general
noise models and leads to efficient, scalable algorithms. Experiments with
synthetic and real datasets provide evidence that the new approach is also
superior in practice. An optimized C++ implementation can factor a
multigigabyte matrix in a matter of minutes.Comment: 17 pages, 10 figures. Modified theorem statement for robust recovery
conditions. Revised proof techniques to make arguments more elementary.
Results on robustness when rows are duplicated have been superseded by
arxiv.org/1211.668
Learning Topic Models - Going beyond SVD
Topic Modeling is an approach used for automatic comprehension and
classification of data in a variety of settings, and perhaps the canonical
application is in uncovering thematic structure in a corpus of documents. A
number of foundational works both in machine learning and in theory have
suggested a probabilistic model for documents, whereby documents arise as a
convex combination of (i.e. distribution on) a small number of topic vectors,
each topic vector being a distribution on words (i.e. a vector of
word-frequencies). Similar models have since been used in a variety of
application areas; the Latent Dirichlet Allocation or LDA model of Blei et al.
is especially popular.
Theoretical studies of topic modeling focus on learning the model's
parameters assuming the data is actually generated from it. Existing approaches
for the most part rely on Singular Value Decomposition(SVD), and consequently
have one of two limitations: these works need to either assume that each
document contains only one topic, or else can only recover the span of the
topic vectors instead of the topic vectors themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main
tool in this context, which is an analog of SVD where all vectors are
nonnegative. Using this tool we give the first polynomial-time algorithm for
learning topic models without the above two limitations. The algorithm uses a
fairly mild assumption about the underlying topic matrix called separability,
which is usually found to hold in real-life data. A compelling feature of our
algorithm is that it generalizes to models that incorporate topic-topic
correlations, such as the Correlated Topic Model and the Pachinko Allocation
Model.
We hope that this paper will motivate further theoretical results that use
NMF as a replacement for SVD - just as NMF has come to replace SVD in many
applications
Robustness Analysis of Hottopixx, a Linear Programming Model for Factoring Nonnegative Matrices
Although nonnegative matrix factorization (NMF) is NP-hard in general, it has
been shown very recently that it is tractable under the assumption that the
input nonnegative data matrix is close to being separable (separability
requires that all columns of the input matrix belongs to the cone spanned by a
small subset of these columns). Since then, several algorithms have been
designed to handle this subclass of NMF problems. In particular, Bittorf,
Recht, R\'e and Tropp (`Factoring nonnegative matrices with linear programs',
NIPS 2012) proposed a linear programming model, referred to as Hottopixx. In
this paper, we provide a new and more general robustness analysis of their
method. In particular, we design a provably more robust variant using a
post-processing strategy which allows us to deal with duplicates and near
duplicates in the dataset.Comment: 23 pages; new numerical results; Comparison with Arora et al.;
Accepted in SIAM J. Mat. Anal. App
A Topic Modeling Approach to Ranking
We propose a topic modeling approach to the prediction of preferences in
pairwise comparisons. We develop a new generative model for pairwise
comparisons that accounts for multiple shared latent rankings that are
prevalent in a population of users. This new model also captures inconsistent
user behavior in a natural way. We show how the estimation of latent rankings
in the new generative model can be formally reduced to the estimation of topics
in a statistically equivalent topic modeling problem. We leverage recent
advances in the topic modeling literature to develop an algorithm that can
learn shared latent rankings with provable consistency as well as sample and
computational complexity guarantees. We demonstrate that the new approach is
empirically competitive with the current state-of-the-art approaches in
predicting preferences on some semi-synthetic and real world datasets
- …