Search CORE

77 research outputs found

Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression

Author: McCallum Andrew
Mimno David
Publication venue
Publication date: 13/06/2012
Field of study

Although fully generative models have been successfully used to model the contents of text documents, they are often awkward to apply to combinations of text data and document metadata. In this paper we propose a Dirichlet-multinomial regression (DMR) topic model that includes a log-linear prior on document-topic distributions that is a function of observed features of the document, such as author, publication venue, references, and dates. We show that by selecting appropriate features, DMR topic models can meet or exceed the performance of several previously published topic models designed for specific data.Comment: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008

arXiv.org e-Print Archive

ScholarWorks@UMass Amherst

Sparse Stochastic Inference for Latent Dirichlet allocation

Author: Blei David
Hoffman Matt
Mimno David
Publication venue
Publication date: 01/01/2012
Field of study

We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Recommended from our members

Expertise Modeling for Matching Papers with Reviewers

Author: Mimno David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2007
Field of study

An essential part of an expert-nding task, such as matching reviewers to submitted pa- pers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based ap- proaches with a novel topic model, Author- Persona-Topic (APT). In this model, each author can write under one or more \per- sonas, which are represented as indepen- dent distributions over hidden topics. Exam- ples of previous papers written by prospec- tive reviewers are gathered from the Rexa database, which extracts and disambiguates author mentions from documents gathered from the web. We evaluate the models us- ing a reviewer matching task based on human relevance judgments determining how well the expertise of proposed reviewers matches a submission. We nd that the APT topic model outperforms the other models

ScholarWorks@UMass Amherst

Recommended from our members

Mixtures of Hierarchical Topics with Pachinko Allocation

Author: Mimno David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2007
Field of study

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG struc- ture. It does not, however, represent a nested hierarchy of topics, with some top- ical word distributions representing the vo- cabulary that is shared among several more specic topics. This paper presents hierar- chical PAM|an enhancement that explic- itly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA\u27s topical hierarchy representation with PAM\u27s ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out docu- ments, as well as mutual information between automatically-discovered topics and human- generated categories such as journals

ScholarWorks@UMass Amherst