Search CORE

59,596 research outputs found

Syntactic Topic Models

Author: Blei David M.
Boyd-Graber Jordan
Publication venue
Publication date: 01/01/2008
Field of study

The syntactic topic model (STM) is a Bayesian nonparametric model of language that discovers latent distributions of words (topics) that are both semantically and syntactically coherent. The STM models dependency parsed corpora where sentences are grouped into documents. It assumes that each word is drawn from a latent topic chosen by combining document-level features and the local syntactic context. Each document has a distribution over latent topics, as in topic models, which provides the semantic consistency. Each element in the dependency parse tree also has a distribution over the topics of its children, as in latent-state syntax models, which provides the syntactic consistency. These distributions are convolved so that the topic of each word is likely under both its document and syntactic context. We derive a fast posterior inference algorithm based on variational methods. We report qualitative and quantitative studies on both synthetic data and hand-parsed documents. We show that the STM is a more predictive model of language than current models based only on syntax or only on topics

arXiv.org e-Print Archive

CiteSeerX

Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

Author: Manolopoulou Ioanna
Musolesi Mirco
O'sullivan Jason
Prior Rosie
Vega-Carrasco Mariflor
Publication venue
Publication date: 04/05/2020
Field of study

Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure

arXiv.org e-Print Archive

UCL Discovery

Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?

Author: Agosto
Armstrong
Babchuk
Backstrom
Baumer
Blei
Burford
Charmaz
Clarke
Collins
Corbin
Deerwester
Dourish
Durkheim
Ellison
Elsweiler
Epstein
Foucault
Freeman
Geertz
Gershon
Glaser
Glaser
Glaser
Glaser
Goffman
Goggins
Goldstone
Griffiths
Grimmer
Grimmer
Haraway
Hu
Jockers
Jockers
Leskovec
Li
Lind
Lofland
Ma
Marwick
Marx
Mead
Mohr
Muller
Newell
Newman
Orlikowski
Pang
Pinch
Portwood-Stacer
Ramsay
Ramsay
Rhody
Ritzer
Roberts
Roberts
Rost
Satchell
Shankman
Skinner
Song
Star
Suominen
Tangherlini
Tukey
Underwood
Weber
Wilbur
Wyatt
Publication venue: e-Publications@Marquette
Publication date: 01/06/2017
Field of study

Researchers in information science and related areas have developed various methods for analyzing textual data, such as survey responses. This article describes the application of analysis methods from two distinct fields, one method from interpretive social science and one method from statistical machine learning, to the same survey data. The results show that the two analyses produce some similar and some complementary insights about the phenomenon of interest, in this case, nonuse of social media. We compare both the processes of conducting these analyses and the results they produce to derive insights about each method\u27s unique advantages and drawbacks, as well as the broader roles that these methods play in the respective fields where they are often used. These insights allow us to make more informed decisions about the tradeoffs in choosing different methods for analyzing textual data. Furthermore, this comparison suggests ways that such methods might be combined in novel and compelling ways

epublications@Marquette

Crossref

Ordering-sensitive and Semantic-aware Topic Modeling

Author: Cui Tianyi
Tu Wenting
Yang Min
Publication venue
Publication date: 12/02/2015
Field of study

Topic modeling of textual corpora is an important and challenging problem. In most previous work, the "bag-of-words" assumption is usually made which ignores the ordering of words. This assumption simplifies the computation, but it unrealistically loses the ordering information and the semantic of words in the context. In this paper, we present a Gaussian Mixture Neural Topic Model (GMNTM) which incorporates both the ordering of words and the semantic meaning of sentences into topic modeling. Specifically, we represent each topic as a cluster of multi-dimensional vectors and embed the corpus into a collection of vectors generated by the Gaussian mixture model. Each word is affected not only by its topic, but also by the embedding vector of its surrounding words and the context. The Gaussian mixture components and the topic of documents, sentences and words can be learnt jointly. Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains significantly better performance in terms of perplexity, retrieval accuracy and classification accuracy.Comment: To appear in proceedings of AAAI 201

arXiv.org e-Print Archive

CiteSeerX

Association for the Advancement of Artificial Intelligence: AAAI Publications

Dirichlet belief networks for topic structure learning

Author: Buntine Wray
Du Lan
Zhao He
Zhou Mingyuan
Publication venue
Publication date: 01/01/2018
Field of study

Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201

arXiv.org e-Print Archive

Monash University Research Portal