77 research outputs found
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
Although fully generative models have been successfully used to model the
contents of text documents, they are often awkward to apply to combinations of
text data and document metadata. In this paper we propose a
Dirichlet-multinomial regression (DMR) topic model that includes a log-linear
prior on document-topic distributions that is a function of observed features
of the document, such as author, publication venue, references, and dates. We
show that by selecting appropriate features, DMR topic models can meet or
exceed the performance of several previously published topic models designed
for specific data.Comment: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty
in Artificial Intelligence (UAI2008
Sparse Stochastic Inference for Latent Dirichlet allocation
We present a hybrid algorithm for Bayesian topic models that combines the
efficiency of sparse Gibbs sampling with the scalability of online stochastic
inference. We used our algorithm to analyze a corpus of 1.2 million books (33
billion words) with thousands of topics. Our approach reduces the bias of
variational inference and generalizes to many Bayesian hidden-variable models.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Recommended from our members
Expertise Modeling for Matching Papers with Reviewers
An essential part of an expert-nding task, such as matching reviewers to submitted pa- pers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based ap- proaches with a novel topic model, Author- Persona-Topic (APT). In this model, each author can write under one or more \per- sonas, which are represented as indepen- dent distributions over hidden topics. Exam- ples of previous papers written by prospec- tive reviewers are gathered from the Rexa database, which extracts and disambiguates author mentions from documents gathered from the web. We evaluate the models us- ing a reviewer matching task based on human relevance judgments determining how well the expertise of proposed reviewers matches a submission. We nd that the APT topic model outperforms the other models
Recommended from our members
Mixtures of Hierarchical Topics with Pachinko Allocation
The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG struc- ture. It does not, however, represent a nested hierarchy of topics, with some top- ical word distributions representing the vo- cabulary that is shared among several more specic topics. This paper presents hierar- chical PAM|an enhancement that explic- itly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA\u27s topical hierarchy representation with PAM\u27s ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out docu- ments, as well as mutual information between automatically-discovered topics and human- generated categories such as journals
- …