1,621 research outputs found
DOLDA - a regularized supervised topic model for high-dimensional multi-class regression
Generating user interpretable multi-class predictions in data rich
environments with many classes and explanatory covariates is a daunting task.
We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised
topic model for multi-class classification that can handle both many classes as
well as many covariates. To handle many classes we use the recently proposed
Diagonal Orthant (DO) probit model (Johndrow et al., 2013) together with an
efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al.,
2010). We propose a computationally efficient parallel Gibbs sampler for the
new model. An important advantage of DOLDA is that learned topics are directly
connected to individual classes without the need for a reference class. We
evaluate the model's predictive accuracy on two datasets and demonstrate
DOLDA's advantage in interpreting the generated predictions
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data
mining, latent data discovery, and finding relationships among data, text
documents. Researchers have published many articles in the field of topic
modeling and applied in various fields such as software engineering, political
science, medical and linguistic science, etc. There are various methods for
topic modeling, which Latent Dirichlet allocation (LDA) is one of the most
popular methods in this field. Researchers have proposed various models based
on the LDA in topic modeling. According to previous work, this paper can be
very useful and valuable for introducing LDA approaches in topic modeling. In
this paper, we investigated scholarly articles highly (between 2003 to 2016)
related to Topic Modeling based on LDA to discover the research development,
current trends and intellectual structure of topic modeling. Also, we summarize
challenges and introduce famous tools and datasets in topic modeling based on
LDA.Comment: arXiv admin note: text overlap with arXiv:1505.07302 by other author
An alternative text representation to TF-IDF and Bag-of-Words
In text mining, information retrieval, and machine learning, text documents
are commonly represented through variants of sparse Bag of Words (sBoW) vectors
(e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer
from their inherent over-sparsity and fail to capture word-level synonymy and
polysemy. Especially when labeled data is limited (e.g. in document
classification), or the text documents are short (e.g. emails or abstracts),
many features are rarely observed within the training corpus. This leads to
overfitting and reduced generalization accuracy. In this paper we propose Dense
Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW
document features. dCoT explicitly models absent words by removing and
reconstructing random sub-sets of words in the unlabeled corpus. With this
approach, dCoT learns to reconstruct frequent words from co-occurring
infrequent words and maps the high dimensional sparse sBoW vectors into a
low-dimensional dense representation. We show that the feature removal can be
marginalized out and that the reconstruction can be solved for in closed-form.
We demonstrate empirically, on several benchmark datasets, that dCoT features
significantly improve the classification accuracy across several document
classification tasks
Conceptualization Topic Modeling
Recently, topic modeling has been widely used to discover the abstract topics
in text corpora. Most of the existing topic models are based on the assumption
of three-layer hierarchical Bayesian structure, i.e. each document is modeled
as a probability distribution over topics, and each topic is a probability
distribution over words. However, the assumption is not optimal. Intuitively,
it's more reasonable to assume that each topic is a probability distribution
over concepts, and then each concept is a probability distribution over words,
i.e. adding a latent concept layer between topic layer and word layer in
traditional three-layer assumption. In this paper, we verify the proposed
assumption by incorporating the new assumption in two representative topic
models, and obtain two novel topic models. Extensive experiments were conducted
among the proposed models and corresponding baselines, and the results show
that the proposed models significantly outperform the baselines in terms of
case study and perplexity, which means the new assumption is more reasonable
than traditional one.Comment: 7 page
Communication-Free Parallel Supervised Topic Models
Embarrassingly (communication-free) parallel Markov chain Monte Carlo (MCMC)
methods are commonly used in learning graphical models. However, MCMC cannot be
directly applied in learning topic models because of the quasi-ergodicity
problem caused by multimodal distribution of topics. In this paper, we develop
an embarrassingly parallel MCMC algorithm for sLDA. Our algorithm works by
switching the order of sampled topics combination and labeling variable
prediction in sLDA, in which it overcomes the quasi-ergodicity problem because
high-dimension topics that follow a multimodal distribution are projected into
one-dimension document labels that follow a unimodal distribution. Our
empirical experiments confirm that the out-of-sample prediction performance
using our embarrassingly parallel algorithm is comparable to non-parallel sLDA
while the computation time is significantly reduced.Comment: 8 pages, 7 figure
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
The amount of text that is generated every day is increasing dramatically.
This tremendous volume of mostly unstructured text cannot be simply processed
and perceived by computers. Therefore, efficient and effective techniques and
algorithms are required to discover useful patterns. Text mining is the task of
extracting meaningful information from text, which has gained significant
attentions in recent years. In this paper, we describe several of the most
fundamental text mining tasks and techniques including text pre-processing,
classification and clustering. Additionally, we briefly explain text mining in
biomedical and health care domains.Comment: some of References format have update
Supervised Topic Models
We introduce supervised latent Dirichlet allocation (sLDA), a statistical
model of labelled documents. The model accommodates a variety of response
types. We derive an approximate maximum-likelihood procedure for parameter
estimation, which relies on variational methods to handle intractable posterior
expectations. Prediction problems motivate this research: we use the fitted
model to predict response values for new documents. We test sLDA on two
real-world problems: movie ratings predicted from reviews, and the political
tone of amendments in the U.S. Senate based on the amendment text. We
illustrate the benefits of sLDA versus modern regularized regression, as well
as versus an unsupervised LDA analysis followed by a separate regression
Dirichlet Process with Mixed Random Measures: A Nonparametric Topic Model for Labeled Data
We describe a nonparametric topic model for labeled data. The model uses a
mixture of random measures (MRM) as a base distribution of the Dirichlet
process (DP) of the HDP framework, so we call it the DP-MRM. To model labeled
data, we define a DP distributed random measure for each label, and the
resulting model generates an unbounded number of topics for each label. We
apply DP-MRM on single-labeled and multi-labeled corpora of documents and
compare the performance on label prediction with MedLDA, LDA-SVM, and
Labeled-LDA. We further enhance the model by incorporating ddCRP and modeling
multi-labeled images for image segmentation and object labeling, comparing the
performance with nCuts and rddCRP.Comment: ICML201
Centroid estimation based on symmetric KL divergence for Multinomial text classification problem
We define a new method to estimate centroid for text classification based on
the symmetric KL-divergence between the distribution of words in training
documents and their class centroids. Experiments on several standard data sets
indicate that the new method achieves substantial improvements over the
traditional classifiers
Large scale link based latent Dirichlet allocation for web document classification
In this paper we demonstrate the applicability of latent Dirichlet allocation
(LDA) for classifying large Web document collections. One of our main results
is a novel influence model that gives a fully generative model of the document
content taking linkage into account. In our setup, topics propagate along links
in such a way that linked documents directly influence the words in the linking
document. As another main contribution we develop LDA specific boosting of
Gibbs samplers resulting in a significant speedup in our experiments. The
inferred LDA model can be applied for classification as dimensionality
reduction similarly to latent semantic indexing. In addition, the model yields
link weights that can be applied in algorithms to process the Web graph; as an
example we deploy LDA link weights in stacked graphical learning. By using
Weka's BayesNet classifier, in terms of the AUC of classification, we achieve
4% improvement over plain LDA with BayesNet and 18% over tf.idf with SVM. Our
Gibbs sampling strategies yield about 5-10 times speedup with less than 1%
decrease in accuracy in terms of likelihood and AUC of classification.Comment: 16 page
- …