13,847 research outputs found
A Practical Algorithm for Topic Modeling with Provable Guarantees
Topic models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to topic model
inference have been based on a maximum likelihood objective. Efficient
algorithms exist that approximate this objective, but they have no provable
guarantees. Recently, algorithms have been introduced that provide provable
bounds, but these algorithms are not practical because they are inefficient and
not robust to violations of model assumptions. In this paper we present an
algorithm for topic model inference that is both provable and practical. The
algorithm produces results comparable to the best MCMC implementations while
running orders of magnitude faster.Comment: 26 page
Recommended from our members
Local search: A guide for the information retrieval practitioner
There are a number of combinatorial optimisation problems in information retrieval in which the use of local search methods are worthwhile. The purpose of this paper is to show how local search can be used to solve some well known tasks in information retrieval (IR), how previous research in the field is piecemeal, bereft of a structure and methodologically flawed, and to suggest more rigorous ways of applying local search methods to solve IR problems. We provide a query based taxonomy for analysing the use of local search in IR tasks and an overview of issues such as fitness functions, statistical significance and test collections when conducting experiments on combinatorial optimisation problems. The paper gives a guide on the pitfalls and problems for IR practitioners who wish to use local search to solve their research issues, and gives practical advice on the use of such methods. The query based taxonomy is a novel structure which can be used by the IR practitioner in order to examine the use of local search in IR
Maximally selected chi-square statistics and binary splits of nominal variables
We address the problem of maximally selected chi-square statistics in the case of a binary Y variable and a nominal X variable with several categories. The distribution of the maximally selected chi-square statistic has already been derived when the best cutpoint is chosen from a continuous or an ordinal X, but not when the best split is chosen from a nominal X. In this paper, we derive the exact distribution of the maximally selected chi-square statistic in this case using a combinatorial approach. Applications of the derived distribution to variable selection and hypothesis testing are discussed based on simulations. As an illustration, our method is applied to a pregnancy and birth data set
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
Creating Capsule Wardrobes from Fashion Images
We propose to automatically create capsule wardrobes. Given an inventory of
candidate garments and accessories, the algorithm must assemble a minimal set
of items that provides maximal mix-and-match outfits. We pose the task as a
subset selection problem. To permit efficient subset selection over the space
of all outfit combinations, we develop submodular objective functions capturing
the key ingredients of visual compatibility, versatility, and user-specific
preference. Since adding garments to a capsule only expands its possible
outfits, we devise an iterative approach to allow near-optimal submodular
function maximization. Finally, we present an unsupervised approach to learn
visual compatibility from "in the wild" full body outfit photos; the
compatibility metric translates well to cleaner catalog photos and improves
over existing methods. Our results on thousands of pieces from popular fashion
websites show that automatic capsule creation has potential to mimic skilled
fashionistas in assembling flexible wardrobes, while being significantly more
scalable.Comment: Accepted to CVPR 201
Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling
The beta-negative binomial process (BNBP), an integer-valued stochastic
process, is employed to partition a count vector into a latent random count
matrix. As the marginal probability distribution of the BNBP that governs the
exchangeable random partitions of grouped data has not yet been developed,
current inference for the BNBP has to truncate the number of atoms of the beta
process. This paper introduces an exchangeable partition probability function
to explicitly describe how the BNBP clusters the data points of each group into
a random number of exchangeable partitions, which are shared across all the
groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a
novel nonparametric Bayesian topic model that is distinct from existing ones,
with simple implementation, fast convergence, good mixing, and state-of-the-art
predictive performance.Comment: in Neural Information Processing Systems (NIPS) 2014. 9 pages + 3
page appendi
Influence Maximization with Bandits
We consider the problem of \emph{influence maximization}, the problem of
maximizing the number of people that become aware of a product by finding the
`best' set of `seed' users to expose the product to. Most prior work on this
topic assumes that we know the probability of each user influencing each other
user, or we have data that lets us estimate these influences. However, this
information is typically not initially available or is difficult to obtain. To
avoid this assumption, we adopt a combinatorial multi-armed bandit paradigm
that estimates the influence probabilities as we sequentially try different
seed sets. We establish bounds on the performance of this procedure under the
existing edge-level feedback as well as a novel and more realistic node-level
feedback. Beyond our theoretical results, we describe a practical
implementation and experimentally demonstrate its efficiency and effectiveness
on four real datasets.Comment: 12 page
- …