5,215 research outputs found
Temporal Topic Analysis with Endogenous and Exogenous Processes
We consider the problem of modeling temporal textual data taking endogenous
and exogenous processes into account. Such text documents arise in real world
applications, including job advertisements and economic news articles, which
are influenced by the fluctuations of the general economy. We propose a
hierarchical Bayesian topic model which imposes a "group-correlated"
hierarchical structure on the evolution of topics over time incorporating both
processes, and show that this model can be estimated from Markov chain Monte
Carlo sampling methods. We further demonstrate that this model captures the
intrinsic relationships between the topic distribution and the time-dependent
factors, and compare its performance with latent Dirichlet allocation (LDA) and
two other related models. The model is applied to two collections of documents
to illustrate its empirical performance: online job advertisements from
DirectEmployers Association and journalists' postings on BusinessInsider.com
Leveraging Node Attributes for Incomplete Relational Data
Relational data are usually highly incomplete in practice, which inspires us
to leverage side information to improve the performance of community detection
and link prediction. This paper presents a Bayesian probabilistic approach that
incorporates various kinds of node attributes encoded in binary form in
relational models with Poisson likelihood. Our method works flexibly with both
directed and undirected relational networks. The inference can be done by
efficient Gibbs sampling which leverages sparsity of both networks and node
attributes. Extensive experiments show that our models achieve the
state-of-the-art link prediction results, especially with highly incomplete
relational data.Comment: Appearing in ICML 201
Scalable Inference of Customer Similarities from Interactions Data using Dirichlet Processes
Under the sociological theory of homophily, people who are similar to one
another are more likely to interact with one another. Marketers often have
access to data on interactions among customers from which, with homophily as a
guiding principle, inferences could be made about the underlying similarities.
However, larger networks face a quadratic explosion in the number of potential
interactions that need to be modeled. This scalability problem renders
probability models of social interactions computationally infeasible for all
but the smallest networks. In this paper we develop a probabilistic framework
for modeling customer interactions that is both grounded in the theory of
homophily, and is flexible enough to account for random variation in who
interacts with whom. In particular, we present a novel Bayesian nonparametric
approach, using Dirichlet processes, to moderate the scalability problems that
marketing researchers encounter when working with networked data. We find that
this framework is a powerful way to draw insights into latent similarities of
customers, and we discuss how marketers can apply these insights to
segmentation and targeting activities
Mixed membership stochastic blockmodels
Observations consisting of measurements on relationships for pairs of objects
arise in many settings, such as protein interaction and gene regulatory
networks, collections of author-recipient email, and social networks. Analyzing
such data with probabilisic models can be delicate because the simple
exchangeability assumptions underlying many boilerplate models no longer hold.
In this paper, we describe a latent variable model of such data called the
mixed membership stochastic blockmodel. This model extends blockmodels for
relational data to ones which capture mixed membership latent relational
structure, thus providing an object-specific low-dimensional representation. We
develop a general variational inference algorithm for fast approximate
posterior inference. We explore applications to social and protein interaction
networks.Comment: 46 pages, 14 figures, 3 table
Hierarchical relational models for document networks
We develop the relational topic model (RTM), a hierarchical model of both
network structure and node attributes. We focus on document networks, where the
attributes of each document are its words, that is, discrete observations taken
from a fixed vocabulary. For each pair of documents, the RTM models their link
as a binary random variable that is conditioned on their contents. The model
can be used to summarize a network of documents, predict links between them,
and predict words within them. We derive efficient inference and estimation
algorithms based on variational methods that take advantage of sparsity and
scale with the number of links. We evaluate the predictive performance of the
RTM for large networks of scientific abstracts, web documents, and
geographically tagged news.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS309 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes
In this paper we propose a Bayesian nonparametric model for clustering
partial ranking data. We start by developing a Bayesian nonparametric extension
of the popular Plackett-Luce choice model that can handle an infinite number of
choice items. Our framework is based on the theory of random atomic measures,
with the prior specified by a completely random measure. We characterise the
posterior distribution given data, and derive a simple and effective Gibbs
sampler for posterior simulation. We then develop a Dirichlet process mixture
extension of our model and apply it to investigate the clustering of
preferences for college degree programmes amongst Irish secondary school
graduates. The existence of clusters of applicants who have similar preferences
for degree programmes is established and we determine that subject matter and
geographical location of the third level institution characterise these
clusters.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS717 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
The Discrete Infinite Logistic Normal Distribution
We present the discrete infinite logistic normal distribution (DILN), a
Bayesian nonparametric prior for mixed membership models. DILN is a
generalization of the hierarchical Dirichlet process (HDP) that models
correlation structure between the weights of the atoms at the group level. We
derive a representation of DILN as a normalized collection of gamma-distributed
random variables, and study its statistical properties. We consider
applications to topic modeling and derive a variational inference algorithm for
approximate posterior inference. We study the empirical performance of the DILN
topic model on four corpora, comparing performance with the HDP and the
correlated topic model (CTM). To deal with large-scale data sets, we also
develop an online inference algorithm for DILN and compare with online HDP and
online LDA on the Nature magazine, which contains approximately 350,000
articles.Comment: This paper will appear in Bayesian Analysis. A shorter version of
this paper appeared at AISTATS 2011, Fort Lauderdale, FL, US
- …