5,699 research outputs found
A Gamma-Poisson Mixture Topic Model for Short Text
Most topic models are constructed under the assumption that documents follow
a multinomial distribution. The Poisson distribution is an alternative
distribution to describe the probability of count data. For topic modelling,
the Poisson distribution describes the number of occurrences of a word in
documents of fixed length. The Poisson distribution has been successfully
applied in text classification, but its application to topic modelling is not
well documented, specifically in the context of a generative probabilistic
model. Furthermore, the few Poisson topic models in literature are admixture
models, making the assumption that a document is generated from a mixture of
topics. In this study, we focus on short text. Many studies have shown that the
simpler assumption of a mixture model fits short text better. With mixture
models, as opposed to admixture models, the generative assumption is that a
document is generated from a single topic. One topic model, which makes this
one-topic-per-document assumption, is the Dirichlet-multinomial mixture model.
The main contributions of this work are a new Gamma-Poisson mixture model, as
well as a collapsed Gibbs sampler for the model. The benefit of the collapsed
Gibbs sampler derivation is that the model is able to automatically select the
number of topics contained in the corpus. The results show that the
Gamma-Poisson mixture model performs better than the Dirichlet-multinomial
mixture model at selecting the number of topics in labelled corpora.
Furthermore, the Gamma-Poisson mixture produces better topic coherence scores
than the Dirichlet-multinomial mixture model, thus making it a viable option
for the challenging task of topic modelling of short text.Comment: 26 pages, 14 Figures, to be published in Mathematical Problems in
Engineerin
Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application
We present two novel models of document coherence and their application to
information retrieval (IR). Both models approximate document coherence using
discourse entities, e.g. the subject or object of a sentence. Our first model
views text as a Markov process generating sequences of discourse entities
(entity n-grams); we use the entropy of these entity n-grams to approximate the
rate at which new information appears in text, reasoning that as more new words
appear, the topic increasingly drifts and text coherence decreases. Our second
model extends the work of Guinaudeau & Strube [28] that represents text as a
graph of discourse entities, linked by different relations, such as their
distance or adjacency in text. We use several graph topology metrics to
approximate different aspects of the discourse flow that can indicate
coherence, such as the average clustering or betweenness of discourse entities
in text. Experiments with several instantiations of these models show that: (i)
our models perform on a par with two other well-known models of text coherence
even without any parameter tuning, and (ii) reranking retrieval results
according to their coherence scores gives notable performance gains, confirming
a relation between document coherence and relevance. This work contributes two
novel models of document coherence, the application of which to IR complements
recent work in the integration of document cohesiveness or comprehensibility to
ranking [5, 56]
Probabilistic approaches for modeling text structure and their application to text-to-text generation
Since the early days of generation research, it has been acknowledged that modeling the global structure of a document is crucial for producing coherent, readable output. However, traditional knowledge-intensive approaches have been of limited utility in addressing this problem since they cannot be effectively scaled to operate in domain-independent, large-scale applications. Due to this difficulty, existing text-to-text generation systems rarely rely on such structural information when producing an output text. Consequently, texts generated by these methods do not match the quality of those written by humans – they are often fraught with severe coherence violations and disfluencies.
In this chapter, I will present probabilistic models of document structure that can be effectively learned from raw document collections. This feature distinguishes these new models from traditional knowledge intensive approaches used in symbolic concept-to-text generation. Our results demonstrate that these probabilistic models can be directly applied to content organization, and suggest that these models can prove useful in an even broader range of text-to-text applications than we have considered here.National Science Foundation (U.S.) (CAREER grant IIS- 0448168)Microsoft Research. New Faculty Fellowshi
Dirichlet belief networks for topic structure learning
Recently, considerable research effort has been devoted to developing deep
architectures for topic models to learn topic structures. Although several deep
models have been proposed to learn better topic proportions of documents, how
to leverage the benefits of deep structures for learning word distributions of
topics has not yet been rigorously studied. Here we propose a new multi-layer
generative process on word distributions of topics, where each layer consists
of a set of topics and each topic is drawn from a mixture of the topics of the
layer above. As the topics in all layers can be directly interpreted by words,
the proposed model is able to discover interpretable topic hierarchies. As a
self-contained module, our model can be flexibly adapted to different kinds of
topic models to improve their modelling accuracy and interpretability.
Extensive experiments on text corpora demonstrate the advantages of the
proposed model.Comment: accepted in NIPS 201
- …