34 research outputs found
Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes
In this paper we propose a Bayesian nonparametric model for clustering
partial ranking data. We start by developing a Bayesian nonparametric extension
of the popular Plackett-Luce choice model that can handle an infinite number of
choice items. Our framework is based on the theory of random atomic measures,
with the prior specified by a completely random measure. We characterise the
posterior distribution given data, and derive a simple and effective Gibbs
sampler for posterior simulation. We then develop a Dirichlet process mixture
extension of our model and apply it to investigate the clustering of
preferences for college degree programmes amongst Irish secondary school
graduates. The existence of clusters of applicants who have similar preferences
for degree programmes is established and we determine that subject matter and
geographical location of the third level institution characterise these
clusters.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS717 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian Plackett--Luce Mixture Models for Partially Ranked Data
The elicitation of an ordinal judgment on multiple alternatives is often required in many psychological
and behavioral experiments to investigate preference/choice orientation of a specific population. The
Plackett–Luce model is one of the most popular and frequently applied parametric distributions to analyze
rankings of a finite set of items. The present work introduces a Bayesian finite mixture of Plackett–Luce
models to account for unobserved sample heterogeneity of partially ranked data. We describe an efficient
way to incorporate the latent group structure in the data augmentation approach and the derivation of existing
maximum likelihood procedures as special instances of the proposed Bayesian method. Inference can
be conducted with the combination of the Expectation-Maximization algorithm for maximum a posteriori
estimation and the Gibbs sampling iterative procedure.We additionally investigate several Bayesian criteria
for selecting the optimal mixture configuration and describe diagnostic tools for assessing the fitness of
ranking distributions conditionally and unconditionally on the number of ranked items. The utility of the
novel Bayesian parametric Plackett–Luce mixture for characterizing sample heterogeneity is illustrated
with several applications to simulated and real preference ranked data. We compare our method with the
frequentist approach and a Bayesian nonparametric mixture model both assuming the Plackett–Luce model
as a mixture component. Our analysis on real datasets reveals the importance of an accurate diagnostic
check for an appropriate in-depth understanding of the heterogenous nature of the partial ranking data
Bayesian modelling and analysis of ranked data
PhD ThesisRanked data are central to many applications in science and social science and arise when
rankers (individuals) use some criterion to order a set of entities. Such rankings are
therefore equivalent to permutations of the elements of a set. The majority of models
for ranked data rely on a strong assumption of homogeneity, such as all rankers sharing
the same view on preferences of the entities. The aim of this thesis is to develop a richer
class of models which can reveal any plausible subgroup structure within the data both
for rankers and entities.
We begin by looking at the Plackett–Luce model, an extension of the Bradley–Terry model
for paired comparisons. First this model is extended to cater for when rankers do not report
a full ranking of all entities. For example, they might only report their top five ranked
entities after seeing some or all entities. Another issue is that most work in this area
assumes that all rankers are equally informed about the entities they are ranking. Often
this assumption will be questionable and so we develop a model which allows rankers to
have differing reliability. This model, the Weighted Plackett–Luce model, allows for such
heterogeneity through a novel two component mixture model defined by augmenting the
Plackett–Luce model.
The idea that rankers may be heterogeneous in their beliefs about entities is not new.
However, there might be groups of rankers with each group sharing the same view about
entities. Generally the number of such groups will not be known and so we investigate
the possibility of such group structure by using a Dirichlet process mixture of Weighted
Plackett–Luce models. It can also be useful to assess whether some entities are exchangeable, that is, whether there is also entity clustering within each ranker group, an issue
that has received little attention in the literature. We extend the model further to explore
both ranker and entity clustering by adapting the Nested Dirichlet process. The resulting
model is a Weighted Adapted Nested Dirichlet (WAND) process mixture of Plackett–Luce
models. Posterior inference is conducted via a simple and efficient Gibbs sampling scheme.
The richness of information in the posterior distribution allows for inference about many
aspects of the clustering structure both between ranker groups and between entity groups
(within ranker groups), in contrast to many other (Bayesian) analyses. The methodology
is illustrated using several simulation studies and real data examples.
Finally, we relax the assumption of a known ranking process underpinning these models
by looking at the recently developed Extended Plackett–Luce model. This model allows
inference for the order in which a homogeneous set of rankers assign entities to ranks.
Analysis of this model is challenging but we have found that using Metropolis coupled
Markov chain Monte Carlo (MC3
) methods can provide adequate mixing over the high
dimensional space of all possible permutations when the number of entities is not small
Rank-based Bayesian clustering via covariate-informed Mallows mixtures
Data in the form of rankings, ratings, pair comparisons or clicks are
frequently collected in diverse fields, from marketing to politics, to
understand assessors' individual preferences. Combining such preference data
with features associated with the assessors can lead to a better understanding
of the assessors' behaviors and choices. The Mallows model is a popular model
for rankings, as it flexibly adapts to different types of preference data, and
the previously proposed Bayesian Mallows Model (BMM) offers a computationally
efficient framework for Bayesian inference, also allowing capturing the users'
heterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite
mixture model that performs clustering while also accounting for
assessor-related features, called the Bayesian Mallows model with covariates
(BMMx). BMMx is based on a similarity function that a priori favours the
aggregation of assessors into a cluster when their covariates are similar,
using the Product Partition models (PPMx) proposal. We present two approaches
to measure the covariate similarity: one based on a novel deterministic
function measuring the covariates' goodness-of-fit to the cluster, and one
based on an augmented model as in PPMx. We investigate the performance of BMMx
in both simulation experiments and real-data examples, showing the method's
potential for advancing the understanding of assessor preferences and behaviors
in different applications
Gamma Processes, Stick-Breaking, and Variational Inference
While most Bayesian nonparametric models in machine learning have focused on
the Dirichlet process, the beta process, or their variants, the gamma process
has recently emerged as a useful nonparametric prior in its own right. Current
inference schemes for models involving the gamma process are restricted to
MCMC-based methods, which limits their scalability. In this paper, we present a
variational inference framework for models involving gamma process priors. Our
approach is based on a novel stick-breaking constructive definition of the
gamma process. We prove correctness of this stick-breaking process by using the
characterization of the gamma process as a completely random measure (CRM), and
we explicitly derive the rate measure of our construction using Poisson process
machinery. We also derive error bounds on the truncation of the infinite
process required for variational inference, similar to the truncation analyses
for other nonparametric models based on the Dirichlet and beta processes. Our
representation is then used to derive a variational inference algorithm for a
particular Bayesian nonparametric latent structure formulation known as the
infinite Gamma-Poisson model, where the latent variables are drawn from a gamma
process prior with Poisson likelihoods. Finally, we present results for our
algorithms on nonnegative matrix factorization tasks on document corpora, and
show that we compare favorably to both sampling-based techniques and
variational approaches based on beta-Bernoulli priors
Modeling heterogeneity in ranked responses by nonparametric maximum likelihood:How do Europeans get their scientific knowledge?
This paper is motivated by a Eurobarometer survey on science knowledge. As part of the survey, respondents were asked to rank sources of science information in order of importance. The official statistical analysis of these data however failed to use the complete ranking information. We instead propose a method which treats ranked data as a set of paired comparisons which places the problem in the standard framework of generalized linear models and also allows respondent covariates to be incorporated. An extension is proposed to allow for heterogeneity in the ranked responses. The resulting model uses a nonparametric formulation of the random effects structure, fitted using the EM algorithm. Each mass point is multivalued, with a parameter for each item. The resultant model is equivalent to a covariate latent class model, where the latent class profiles are provided by the mass point components and the covariates act on the class profiles. This provides an alternative interpretation of the fitted model. The approach is also suitable for paired comparison data
BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data
We perform differential expression analysis of high-throughput sequencing
count data under a Bayesian nonparametric framework, removing sophisticated
ad-hoc pre-processing steps commonly required in existing algorithms. We
propose to use the gamma (beta) negative binomial process, which takes into
account different sequencing depths using sample-specific negative binomial
probability (dispersion) parameters, to detect differentially expressed genes
by comparing the posterior distributions of gene-specific negative binomial
dispersion (probability) parameters. These model parameters are inferred by
borrowing statistical strength across both the genes and samples. Extensive
experiments on both simulated and real-world RNA sequencing count data show
that the proposed differential expression analysis algorithms clearly
outperform previously proposed ones in terms of the areas under both the
receiver operating characteristic and precision-recall curves.Comment: To appear in Journal of the American Statistical Associatio