821,702 research outputs found
Efficient Correlated Topic Modeling with Topic Embedding
Correlated topic modeling has been limited to small model and problem sizes
due to their high computational cost and poor scaling. In this paper, we
propose a new model which learns compact topic embeddings and captures topic
correlations through the closeness between the topic vectors. Our method
enables efficient inference in the low-dimensional embedding space, reducing
previous cubic or quadratic time complexity to linear w.r.t the topic size. We
further speedup variational inference with a fast sampler to exploit sparsity
of topic occurrence. Extensive experiments show that our approach is capable of
handling model and data scales which are several orders of magnitude larger
than existing correlation results, without sacrificing modeling quality by
providing competitive or superior performance in document classification and
retrieval.Comment: KDD 2017 oral. The first two authors contributed equall
Exploratory topic modeling with distributional semantics
As we continue to collect and store textual data in a multitude of domains,
we are regularly confronted with material whose largely unknown thematic
structure we want to uncover. With unsupervised, exploratory analysis, no prior
knowledge about the content is required and highly open-ended tasks can be
supported. In the past few years, probabilistic topic modeling has emerged as a
popular approach to this problem. Nevertheless, the representation of the
latent topics as aggregations of semi-coherent terms limits their
interpretability and level of detail.
This paper presents an alternative approach to topic modeling that maps
topics as a network for exploration, based on distributional semantics using
learned word vectors. From the granular level of terms and their semantic
similarity relations global topic structures emerge as clustered regions and
gradients of concepts. Moreover, the paper discusses the visual interactive
representation of the topic map, which plays an important role in supporting
its exploration.Comment: Conference: The Fourteenth International Symposium on Intelligent
Data Analysis (IDA 2015
A Topic Modeling Approach to Ranking
We propose a topic modeling approach to the prediction of preferences in
pairwise comparisons. We develop a new generative model for pairwise
comparisons that accounts for multiple shared latent rankings that are
prevalent in a population of users. This new model also captures inconsistent
user behavior in a natural way. We show how the estimation of latent rankings
in the new generative model can be formally reduced to the estimation of topics
in a statistically equivalent topic modeling problem. We leverage recent
advances in the topic modeling literature to develop an algorithm that can
learn shared latent rankings with provable consistency as well as sample and
computational complexity guarantees. We demonstrate that the new approach is
empirically competitive with the current state-of-the-art approaches in
predicting preferences on some semi-synthetic and real world datasets
Classifying Web Exploits with Topic Modeling
This short empirical paper investigates how well topic modeling and database
meta-data characteristics can classify web and other proof-of-concept (PoC)
exploits for publicly disclosed software vulnerabilities. By using a dataset
comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is
obtained in the empirical experiment. Text mining and topic modeling are a
significant boost factor behind this classification performance. In addition to
these empirical results, the paper contributes to the research tradition of
enhancing software vulnerability information with text mining, providing also a
few scholarly observations about the potential for semi-automatic
classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and
Expert Systems Applications (DEXA).
http://ieeexplore.ieee.org/abstract/document/8049693
Memory-Efficient Topic Modeling
As one of the simplest probabilistic topic modeling techniques, latent
Dirichlet allocation (LDA) has found many important applications in text
mining, computer vision and computational biology. Recent training algorithms
for LDA can be interpreted within a unified message passing framework. However,
message passing requires storing previous messages with a large amount of
memory space, increasing linearly with the number of documents or the number of
topics. Therefore, the high memory usage is often a major problem for topic
modeling of massive corpora containing a large number of topics. To reduce the
space complexity, we propose a novel algorithm without storing previous
messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP
relates the message passing algorithms with the non-negative matrix
factorization (NMF) algorithms, which absorb the message updating into the
message passing process, and thus avoid storing previous messages. Experimental
results on four large data sets confirm that TBP performs comparably well or
even better than current state-of-the-art training algorithms for LDA but with
a much less memory consumption. TBP can do topic modeling when massive corpora
cannot fit in the computer memory, for example, extracting thematic topics from
7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure
- …
