141 research outputs found

    Learning mixed membership models with a separable latent structure: theory, provably efficient algorithms, and applications

    Full text link
    In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and machine learning tasks such as topic modeling, the observed data is high-dimensional and can be modeled as arising from a data-specific probabilistic mixture of a small collection of latent factors. Being able to successfully learn the latent factors from the observed data is important for efficient data representation, inference, and prediction. Popular approaches such as variational Bayesian and MCMC methods exhibit good empirical performance on some real-world datasets, but make heavy use of approximations and heuristics for dealing with the highly non-convex and computationally intractable optimization objectives that accompany them. As a consequence, consistency or efficiency guarantees for these algorithms are rather weak. This thesis develops a suite of algorithms with provable polynomial statistical and computational efficiency guarantees for learning a wide class of high-dimensional Mixed Membership Latent Variable Models (MMLVMs). Our approach is based on a natural separability property of the shared latent factors that is known to be either exactly or approximately satisfied by the estimates produced by variational Bayesian and MCMC methods. Latent factors are called separable when each factor contains a novel part that is predominantly unique to that factor. For a broad class of problems, we establish that separability is not only an algorithmically convenient structural condition, but is in fact an inevitable consequence of a having a relatively small number of latent factors in a high-dimensional observation space. The key insight underlying our algorithms is the identification of novel parts of each latent factor as extreme points of certain convex polytopes in a suitable representation space. We show that this can be done efficiently through appropriately defined random projections in the representation space. We establish statistical and computational efficiency bounds that are both polynomial in all the model parameters. Furthermore, the proposed random-projections-based algorithm turns out to be naturally amenable to a low-communication-cost distributed implementation which is attractive for modern web-scale distributed data mining applications. We explore in detail two distinct classes of MMLVMs in this thesis: learning topic models for text documents based on their empirical word frequencies and learning mixed membership ranking models based on pairwise comparison data. For each problem, we demonstrate that separability is inevitable when the data dimension scales up and then establish consistency and efficiency guarantees for identifying all novel parts and estimating the latent factors. As a by-product of this analysis, we obtain the first asymptotic consistency and polynomial sample and computational complexity results for learning permutation-mixture and Mallows-mixture models for rankings based on pairwise comparison data. We demonstrate empirically that the performance of our approach is competitive with the current state-of-the-art on a number of real-world datasets

    Modelling rankings in R: the PlackettLuce package

    Get PDF
    This paper presents the R package PlackettLuce, which implements a generalization of the Plackett-Luce model for rankings data. The generalization accommodates both ties (of arbitrary order) and partial rankings (complete rankings of subsets of items). By default, the implementation adds a set of pseudo-comparisons with a hypothetical item, ensuring that the underlying network of wins and losses between items is always strongly connected. In this way, the worth of each item always has a finite maximum likelihood estimate, with finite standard error. The use of pseudo-comparisons also has a regularization effect, shrinking the estimated parameters towards equal item worth. In addition to standard methods for model summary, PlackettLuce provides a method to compute quasi standard errors for the item parameters. This provides the basis for comparison intervals that do not change with the choice of identifiability constraint placed on the item parameters. Finally, the package provides a method for model-based partitioning using covariates whose values vary between rankings, enabling the identification of subgroups of judges or settings that have different item worths. The features of the package are demonstrated through application to classic and novel data sets.Comment: In v2: review of software implementing alternative models to Plackett-Luce; comparison of algorithms provided by the PlackettLuce package; further examples of rankings where the underlying win-loss network is not strongly connected. In addition, general editing to improve organisation and clarity. In v3: corrected headings Table 4, minor edit

    A Topic Modeling Approach to Ranking

    Full text link
    We propose a topic modeling approach to the prediction of preferences in pairwise comparisons. We develop a new generative model for pairwise comparisons that accounts for multiple shared latent rankings that are prevalent in a population of users. This new model also captures inconsistent user behavior in a natural way. We show how the estimation of latent rankings in the new generative model can be formally reduced to the estimation of topics in a statistically equivalent topic modeling problem. We leverage recent advances in the topic modeling literature to develop an algorithm that can learn shared latent rankings with provable consistency as well as sample and computational complexity guarantees. We demonstrate that the new approach is empirically competitive with the current state-of-the-art approaches in predicting preferences on some semi-synthetic and real world datasets

    A BAYESIAN MALLOWS APPROACH TO NONTRANSITIVE PAIR COMPARISON DATA : HOW HUMAN ARE SOUNDS?

    Get PDF
    We are interested in learning how listeners perceive sounds as having human origins. An experiment was performed with a series of electronically synthesized sounds, and listeners were asked to compare them in pairs. We propose a Bayesian probabilistic method to learn individual preferences from nontransitive pairwise comparison data, as happens when one (or more) individual preferences in the data contradicts what is implied by the others. We build a Bayesian Mallows model in order to handle nontransitive data, with a latent layer of uncertainty which captures the generation of preference misreporting. We then develop a mixture extension of the Mallows model, able to learn individual preferences in a heterogeneous population. The results of our analysis of the musicology experiment are of interest to electroacoustic composers and sound designers, and to the audio industry in general, whose aim is to understand how computer generated sounds can be produced in order to sound more human.Peer reviewe

    Bayesian learning of the Mallows rank model

    Get PDF

    Robust Plackett–Luce model for k-ary crowdsourced preferences

    Full text link
    © 2017, The Author(s). The aggregation of k-ary preferences is an emerging ranking problem, which plays an important role in several aspects of our daily life, such as ordinal peer grading and online product recommendation. At the same time, crowdsourcing has become a trendy way to provide a plethora of k-ary preferences for this ranking problem, due to convenient platforms and low costs. However, k-ary preferences from crowdsourced workers are often noisy, which inevitably degenerates the performance of traditional aggregation models. To address this challenge, in this paper, we present a RObust PlAckett–Luce (ROPAL) model. Specifically, to ensure the robustness, ROPAL integrates the Plackett–Luce model with a denoising vector. Based on the Kendall-tau distance, this vector corrects k-ary crowdsourced preferences with a certain probability. In addition, we propose an online Bayesian inference to make ROPAL scalable to large-scale preferences. We conduct comprehensive experiments on simulated and real-world datasets. Empirical results on “massive synthetic” and “real-world” datasets show that ROPAL with online Bayesian inference achieves substantial improvements in robustness and noisy worker detection over current approaches
    • …
    corecore