141 research outputs found
Learning mixed membership models with a separable latent structure: theory, provably efficient algorithms, and applications
In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and machine learning tasks such as topic modeling, the observed data is high-dimensional and can be modeled as arising from a data-specific probabilistic mixture of a small collection of latent factors. Being able to successfully learn the latent factors from the observed data is important for efficient data representation, inference, and prediction. Popular approaches such as variational Bayesian and MCMC methods exhibit good empirical performance on some real-world datasets, but make heavy use of approximations and heuristics for dealing with the highly non-convex and computationally intractable optimization objectives that accompany them. As a consequence, consistency or efficiency guarantees for these algorithms are rather weak.
This thesis develops a suite of algorithms with provable polynomial statistical and computational efficiency guarantees for learning a wide class of high-dimensional Mixed Membership Latent Variable Models (MMLVMs). Our approach is based on a natural separability property of the shared latent factors that is known to be either exactly or approximately satisfied by the estimates produced by variational Bayesian and MCMC methods. Latent factors are called separable when each factor contains a novel part that is predominantly unique to that factor. For a broad class of problems, we establish that separability is not only an algorithmically convenient structural condition, but is in fact an inevitable consequence of a having a relatively small number of latent factors in a high-dimensional observation space. The key insight underlying our algorithms is the identification of novel parts of each latent factor as extreme points of certain convex polytopes in a suitable representation space. We show that this can be done efficiently through appropriately defined random projections in the representation space. We establish statistical and computational efficiency bounds that are both polynomial in all the model parameters. Furthermore, the proposed random-projections-based algorithm turns out to be naturally amenable to a low-communication-cost distributed implementation which is attractive for modern web-scale distributed data mining applications.
We explore in detail two distinct classes of MMLVMs in this thesis: learning topic models for text documents based on their empirical word frequencies and learning mixed membership ranking models based on pairwise comparison data. For each problem, we demonstrate that separability is inevitable when the data dimension scales up and then establish consistency and efficiency guarantees for identifying all novel parts and estimating the latent factors. As a by-product of this analysis, we obtain the first asymptotic consistency and polynomial sample and computational complexity results for learning permutation-mixture and Mallows-mixture models for rankings based on pairwise comparison data. We demonstrate empirically that the performance of our approach is competitive with the current state-of-the-art on a number of real-world datasets
Modelling rankings in R: the PlackettLuce package
This paper presents the R package PlackettLuce, which implements a
generalization of the Plackett-Luce model for rankings data. The generalization
accommodates both ties (of arbitrary order) and partial rankings (complete
rankings of subsets of items). By default, the implementation adds a set of
pseudo-comparisons with a hypothetical item, ensuring that the underlying
network of wins and losses between items is always strongly connected. In this
way, the worth of each item always has a finite maximum likelihood estimate,
with finite standard error. The use of pseudo-comparisons also has a
regularization effect, shrinking the estimated parameters towards equal item
worth. In addition to standard methods for model summary, PlackettLuce provides
a method to compute quasi standard errors for the item parameters. This
provides the basis for comparison intervals that do not change with the choice
of identifiability constraint placed on the item parameters. Finally, the
package provides a method for model-based partitioning using covariates whose
values vary between rankings, enabling the identification of subgroups of
judges or settings that have different item worths. The features of the package
are demonstrated through application to classic and novel data sets.Comment: In v2: review of software implementing alternative models to
Plackett-Luce; comparison of algorithms provided by the PlackettLuce package;
further examples of rankings where the underlying win-loss network is not
strongly connected. In addition, general editing to improve organisation and
clarity. In v3: corrected headings Table 4, minor edit
A Topic Modeling Approach to Ranking
We propose a topic modeling approach to the prediction of preferences in
pairwise comparisons. We develop a new generative model for pairwise
comparisons that accounts for multiple shared latent rankings that are
prevalent in a population of users. This new model also captures inconsistent
user behavior in a natural way. We show how the estimation of latent rankings
in the new generative model can be formally reduced to the estimation of topics
in a statistically equivalent topic modeling problem. We leverage recent
advances in the topic modeling literature to develop an algorithm that can
learn shared latent rankings with provable consistency as well as sample and
computational complexity guarantees. We demonstrate that the new approach is
empirically competitive with the current state-of-the-art approaches in
predicting preferences on some semi-synthetic and real world datasets
A BAYESIAN MALLOWS APPROACH TO NONTRANSITIVE PAIR COMPARISON DATA : HOW HUMAN ARE SOUNDS?
We are interested in learning how listeners perceive sounds as having human origins. An experiment was performed with a series of electronically synthesized sounds, and listeners were asked to compare them in pairs. We propose a Bayesian probabilistic method to learn individual preferences from nontransitive pairwise comparison data, as happens when one (or more) individual preferences in the data contradicts what is implied by the others. We build a Bayesian Mallows model in order to handle nontransitive data, with a latent layer of uncertainty which captures the generation of preference misreporting. We then develop a mixture extension of the Mallows model, able to learn individual preferences in a heterogeneous population. The results of our analysis of the musicology experiment are of interest to electroacoustic composers and sound designers, and to the audio industry in general, whose aim is to understand how computer generated sounds can be produced in order to sound more human.Peer reviewe
Robust Plackett–Luce model for k-ary crowdsourced preferences
© 2017, The Author(s). The aggregation of k-ary preferences is an emerging ranking problem, which plays an important role in several aspects of our daily life, such as ordinal peer grading and online product recommendation. At the same time, crowdsourcing has become a trendy way to provide a plethora of k-ary preferences for this ranking problem, due to convenient platforms and low costs. However, k-ary preferences from crowdsourced workers are often noisy, which inevitably degenerates the performance of traditional aggregation models. To address this challenge, in this paper, we present a RObust PlAckett–Luce (ROPAL) model. Specifically, to ensure the robustness, ROPAL integrates the Plackett–Luce model with a denoising vector. Based on the Kendall-tau distance, this vector corrects k-ary crowdsourced preferences with a certain probability. In addition, we propose an online Bayesian inference to make ROPAL scalable to large-scale preferences. We conduct comprehensive experiments on simulated and real-world datasets. Empirical results on “massive synthetic” and “real-world” datasets show that ROPAL with online Bayesian inference achieves substantial improvements in robustness and noisy worker detection over current approaches
- …