2,550,501 research outputs found
On a Topic Model for Sentences
Probabilistic topic models are generative models that describe the content of
documents by discovering the latent topics underlying them. However, the
structure of the textual input, and for instance the grouping of words in
coherent text spans such as sentences, contains much information which is
generally lost with these models. In this paper, we propose sentenceLDA, an
extension of LDA whose goal is to overcome this limitation by incorporating the
structure of the text in the generative and inference processes. We illustrate
the advantages of sentenceLDA by comparing it with LDA using both intrinsic
(perplexity) and extrinsic (text classification) evaluation tasks on different
text collections
Recommended from our members
New topic detection in microblogs and topic model evaluation using topical alignment
textThis thesis deals with topic model evaluation and new topic detection in microblogs. Microblogs are short and thus may not carry any contextual clues. Hence it becomes challenging to apply traditional natural language processing algorithms on such data. Graphical models have been traditionally used for topic discovery and text clustering on sets of text-based documents. Their unsupervised nature allows topic models to be trained easily on datasets meant for specific domains. However the advantage of not requiring annotated data comes with a drawback with respect to evaluation difficulties. The problem aggravates when the data comprises microblogs which are unstructured and noisy.
We demonstrate the application of three types of such models to microblogs - the Latent Dirichlet Allocation, the Author-Topic and the Author-Recipient-Topic model. We extensively evaluate these models under different settings, and our results show that the Author-Recipient-Topic model extracts the most coherent topics. We also addressed the problem of topic modeling on short text by using clustering techniques. This technique helps in boosting the performance of our models.
Topical alignment is used for large scale assessment of topical relevance by comparing topics to manually generated domain specific concepts. In this thesis we use this idea to evaluate topic models by measuring misalignments between topics. Our study on comparing topic models reveals interesting traits about Twitter messages, users and their interactions and establishes that joint modeling on author-recipient pairs and on the content of tweet leads to qualitatively better topic discovery.
This thesis gives a new direction to the well known problem of topic discovery in microblogs. Trend prediction or topic discovery for microblogs is an extensive research area. We propose the idea of using topical alignment to detect new topics by comparing topics from the current week to those of the previous week. We measure correspondence between a set of topics from the current week and a set of topics from the previous week to quantify five types of misalignments: \textit{junk, fused, missing} and \textit{repeated}. Our analysis compares three types of topic models under different settings and demonstrates how our framework can detect new topics from topical misalignments. In particular so-called \textit{junk} topics are more likely to be new topics and the \textit{missing} topics are likely to have died or die out.
To get more insights into the nature of microblogs we apply topical alignment to hashtags. Comparing topics to hashtags enables us to make interesting inferences about Twitter messages and their content. Our study revealed that although a very small proportion of Twitter messages explicitly contain hashtags, the proportion of tweets that discuss topics related to hashtags is much higher.Computer Science
Correction: A correlated topic model of Science
Correction to Annals of Applied Statistics 1 (2007) 17--35
[doi:10.1214/07-AOAS114]Comment: Published in at http://dx.doi.org/10.1214/07-AOAS136 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Stochastic Divergence Minimization for Biterm Topic Model
As the emergence and the thriving development of social networks, a huge
number of short texts are accumulated and need to be processed. Inferring
latent topics of collected short texts is useful for understanding its hidden
structure and predicting new contents. Unlike conventional topic models such as
latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently
proposed for short texts to overcome the sparseness of document-level word
co-occurrences by directly modeling the generation process of word pairs.
Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and
collapsed variational inference have been proposed for BTM. However, they
either require large computational complexity, or rely on very crude
estimation. In this work, we develop a stochastic divergence minimization
inference algorithm for BTM to estimate latent topics more accurately in a
scalable way. Experiments demonstrate the superiority of our proposed algorithm
compared with existing inference algorithms.Comment: 19 pages, 4 figure
- …
