7,178 research outputs found
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data
mining, latent data discovery, and finding relationships among data, text
documents. Researchers have published many articles in the field of topic
modeling and applied in various fields such as software engineering, political
science, medical and linguistic science, etc. There are various methods for
topic modeling, which Latent Dirichlet allocation (LDA) is one of the most
popular methods in this field. Researchers have proposed various models based
on the LDA in topic modeling. According to previous work, this paper can be
very useful and valuable for introducing LDA approaches in topic modeling. In
this paper, we investigated scholarly articles highly (between 2003 to 2016)
related to Topic Modeling based on LDA to discover the research development,
current trends and intellectual structure of topic modeling. Also, we summarize
challenges and introduce famous tools and datasets in topic modeling based on
LDA.Comment: arXiv admin note: text overlap with arXiv:1505.07302 by other author
Short Text Topic Modeling Techniques, Applications, and Performance: A Survey
Analyzing short texts infers discriminative and coherent latent topics that
is a critical and fundamental task since many real-world applications require
semantic understanding of short texts. Traditional long text topic modeling
algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this
problem very well since only very limited word co-occurrence information is
available in short texts. Therefore, short text topic modeling has already
attracted much attention from the machine learning research community in recent
years, which aims at overcoming the problem of sparseness in short texts. In
this survey, we conduct a comprehensive review of various short text topic
modeling techniques proposed in the literature. We present three categories of
methods based on Dirichlet multinomial mixture, global word co-occurrences, and
self-aggregation, with example of representative approaches in each category
and analysis of their performance on various tasks. We develop the first
comprehensive open-source library, called STTM, for use in Java that integrates
all surveyed algorithms within a unified interface, benchmark datasets, to
facilitate the expansion of new methods in this research field. Finally, we
evaluate these state-of-the-art methods on many real-world datasets and compare
their performance against one another and versus long text topic modeling
algorithm.Comment: arXiv admin note: text overlap with arXiv:1808.02215 by other author
Spatial Semantic Scan: Jointly Detecting Subtle Events and their Spatial Footprint
Many methods have been proposed for detecting emerging events in text streams
using topic modeling. However, these methods have shortcomings that make them
unsuitable for rapid detection of locally emerging events on massive text
streams. We describe Spatially Compact Semantic Scan (SCSS) that has been
developed specifically to overcome the shortcomings of current methods in
detecting new spatially compact events in text streams. SCSS employs
alternating optimization between using semantic scan to estimate contrastive
foreground topics in documents, and discovering spatial neighborhoods with high
occurrence of documents containing the foreground topics. We evaluate our
method on Emergency Department chief complaints dataset (ED dataset) to verify
the effectiveness of our method in detecting real-world disease outbreaks from
free-text ED chief complaint data.Comment: 26 page
Improving Topic Models with Latent Feature Word Representations
Probabilistic topic models are widely used to discover latent topics in
document collections, while latent feature vector representations of words have
been used to obtain high performance in many NLP tasks. In this paper, we
extend two different Dirichlet multinomial topic models by incorporating latent
feature vector representations of words trained on very large corpora to
improve the word-topic mapping learnt on a smaller corpus. Experimental results
show that by using information from the external corpora, our new models
produce significant improvements on topic coherence, document clustering and
document classification tasks, especially on datasets with few or short
documents.Comment: The published version is available at:
https://transacl.org/ojs/index.php/tacl/article/view/582 ; The source code is
available at: https://github.com/datquocnguyen/LFT
Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering
In the last decade, a variety of topic models have been proposed for text
engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and
Latent Dirichlet Allocation (LDA), most of existing topic models are seldom
applied or considered in industrial scenarios. This phenomenon is caused by the
fact that there are very few convenient tools to support these topic models so
far. Intimidated by the demanding expertise and labor of designing and
implementing parameter inference algorithms, software engineers are prone to
simply resort to PLSA/LDA, without considering whether it is proper for their
problem at hand or not. In this paper, we propose a configurable topic modeling
framework named Familia, in order to bridge the huge gap between academic
research fruits and current industrial practice. Familia supports an important
line of topic models that are widely applicable in text engineering scenarios.
In order to relieve burdens of software engineers without knowledge of Bayesian
networks, Familia is able to conduct automatic parameter inference for a
variety of topic models. Simply through changing the data organization of
Familia, software engineers are able to easily explore a broad spectrum of
existing topic models or even design their own topic models, and find the one
that best suits the problem at hand. With its superior extendability, Familia
has a novel sampling mechanism that strikes balance between effectiveness and
efficiency of parameter inference. Furthermore, Familia is essentially a big
topic modeling framework that supports parallel parameter inference and
distributed parameter storage. The utilities and necessity of Familia are
demonstrated in real-life industrial applications. Familia would significantly
enlarge software engineers' arsenal of topic models and pave the way for
utilizing highly customized topic models in real-life problems.Comment: 21 pages, 15 figure
Nested Variational Autoencoder for Topic Modeling on Microtexts with Word Vectors
Most of the information on the Internet is represented in the form of
microtexts, which are short text snippets such as news headlines or tweets.
These sources of information are abundant, and mining these data could uncover
meaningful insights. Topic modeling is one of the popular methods to extract
knowledge from a collection of documents; however, conventional topic models
such as latent Dirichlet allocation (LDA) are unable to perform well on short
documents, mostly due to the scarcity of word co-occurrence statistics embedded
in the data. The objective of our research is to create a topic model that can
achieve great performances on microtexts while requiring a small runtime for
scalability to large datasets. To solve the lack of information of microtexts,
we allow our method to take advantage of word embeddings for additional
knowledge of relationships between words. For speed and scalability, we apply
autoencoding variational Bayes, an algorithm that can perform efficient
black-box inference in probabilistic models. The result of our work is a novel
topic model called the nested variational autoencoder, which is a distribution
that takes into account word vectors and is parameterized by a neural network
architecture. For optimization, the model is trained to approximate the
posterior distribution of the original LDA model. Experiments show the
improvements of our model on microtexts as well as its runtime advantage.Comment: 27 pages, 9 figures, under review at Expert System
Investor Reaction to Financial Disclosures Across Topics: An Application of Latent Dirichlet Allocation
This paper provides a holistic study of how stock prices vary in their
response to financial disclosures across different topics. Thereby, we
specifically shed light into the extensive amount of filings for which no a
priori categorization of their content exists. For this purpose, we utilize an
approach from data mining - namely, latent Dirichlet allocation - as a means of
topic modeling. This technique facilitates our task of automatically
categorizing, ex ante, the content of more than 70,000 regulatory 8-K filings
from U.S. companies. We then evaluate the subsequent stock market reaction. Our
empirical evidence suggests a considerable discrepancy among various types of
news stories in terms of their relevance and impact on financial markets. For
instance, we find a statistically significant abnormal return in response to
earnings results and credit rating, but also for disclosures regarding business
strategy, the health sector, as well as mergers and acquisitions. Our results
yield findings that benefit managers, investors and policy-makers by indicating
how regulatory filings should be structured and the topics most likely to
precede changes in stock valuations
Fuzzy Approach Topic Discovery in Health and Medical Corpora
The majority of medical documents and electronic health records (EHRs) are in
text format that poses a challenge for data processing and finding relevant
documents. Looking for ways to automatically retrieve the enormous amount of
health and medical knowledge has always been an intriguing topic. Powerful
methods have been developed in recent years to make the text processing
automatic. One of the popular approaches to retrieve information based on
discovering the themes in health & medical corpora is topic modeling, however,
this approach still needs new perspectives. In this research we describe fuzzy
latent semantic analysis (FLSA), a novel approach in topic modeling using fuzzy
perspective. FLSA can handle health & medical corpora redundancy issue and
provides a new method to estimate the number of topics. The quantitative
evaluations show that FLSA produces superior performance and features to latent
Dirichlet allocation (LDA), the most popular topic model.Comment: 12 Pages, International Journal of Fuzzy Systems, 201
Large scale link based latent Dirichlet allocation for web document classification
In this paper we demonstrate the applicability of latent Dirichlet allocation
(LDA) for classifying large Web document collections. One of our main results
is a novel influence model that gives a fully generative model of the document
content taking linkage into account. In our setup, topics propagate along links
in such a way that linked documents directly influence the words in the linking
document. As another main contribution we develop LDA specific boosting of
Gibbs samplers resulting in a significant speedup in our experiments. The
inferred LDA model can be applied for classification as dimensionality
reduction similarly to latent semantic indexing. In addition, the model yields
link weights that can be applied in algorithms to process the Web graph; as an
example we deploy LDA link weights in stacked graphical learning. By using
Weka's BayesNet classifier, in terms of the AUC of classification, we achieve
4% improvement over plain LDA with BayesNet and 18% over tf.idf with SVM. Our
Gibbs sampling strategies yield about 5-10 times speedup with less than 1%
decrease in accuracy in terms of likelihood and AUC of classification.Comment: 16 page
Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA
We introduce a novel approach for estimating Latent Dirichlet Allocation
(LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full
conditional distributions over the latent variable assignments to efficiently
average over multiple samples, for little more computational cost than drawing
a single additional collapsed Gibbs sample. Our approach can be understood as
adapting the soft clustering methodology of Collapsed Variational Bayes (CVB0)
to CGS parameter estimation, in order to get the best of both techniques. Our
estimators can straightforwardly be applied to the output of any existing
implementation of CGS, including modern accelerated variants. We perform
extensive empirical comparisons of our estimators with those of standard
collapsed inference algorithms on real-world data for both unsupervised LDA and
Prior-LDA, a supervised variant of LDA for multi-label classification. Our
results show a consistent advantage of our approach over traditional CGS under
all experimental conditions, and over CVB0 inference in the majority of
conditions. More broadly, our results highlight the importance of averaging
over multiple samples in LDA parameter estimation, and the use of efficient
computational techniques to do so
- …