271 research outputs found
Efficient Correlated Topic Modeling with Topic Embedding
Correlated topic modeling has been limited to small model and problem sizes
due to their high computational cost and poor scaling. In this paper, we
propose a new model which learns compact topic embeddings and captures topic
correlations through the closeness between the topic vectors. Our method
enables efficient inference in the low-dimensional embedding space, reducing
previous cubic or quadratic time complexity to linear w.r.t the topic size. We
further speedup variational inference with a fast sampler to exploit sparsity
of topic occurrence. Extensive experiments show that our approach is capable of
handling model and data scales which are several orders of magnitude larger
than existing correlation results, without sacrificing modeling quality by
providing competitive or superior performance in document classification and
retrieval.Comment: KDD 2017 oral. The first two authors contributed equall
Weakly-Supervised Neural Text Classification
Deep neural networks are gaining increasing popularity for the classic text
classification task, due to their strong expressive power and less requirement
for feature engineering. Despite such attractiveness, neural text
classification models suffer from the lack of training data in many real-world
applications. Although many semi-supervised and weakly-supervised text
classification models exist, they cannot be easily applied to deep neural
models and meanwhile support limited supervision types. In this paper, we
propose a weakly-supervised method that addresses the lack of training data in
neural text classification. Our method consists of two modules: (1) a
pseudo-document generator that leverages seed information to generate
pseudo-labeled documents for model pre-training, and (2) a self-training module
that bootstraps on real unlabeled data for model refinement. Our method has the
flexibility to handle different types of weak supervision and can be easily
integrated into existing deep neural models for text classification. We have
performed extensive experiments on three real-world datasets from different
domains. The results demonstrate that our proposed method achieves inspiring
performance without requiring excessive training data and outperforms baseline
methods significantly.Comment: CIKM 2018 Full Pape
vONTSS: vMF based semi-supervised neural topic modeling with optimal transport
Recently, Neural Topic Models (NTM), inspired by variational autoencoders,
have attracted a lot of research interest; however, these methods have limited
applications in the real world due to the challenge of incorporating human
knowledge. This work presents a semi-supervised neural topic modeling method,
vONTSS, which uses von Mises-Fisher (vMF) based variational autoencoders and
optimal transport. When a few keywords per topic are provided, vONTSS in the
semi-supervised setting generates potential topics and optimizes topic-keyword
quality and topic classification. Experiments show that vONTSS outperforms
existing semi-supervised topic modeling methods in classification accuracy and
diversity. vONTSS also supports unsupervised topic modeling. Quantitative and
qualitative experiments show that vONTSS in the unsupervised setting
outperforms recent NTMs on multiple aspects: vONTSS discovers highly clustered
and coherent topics on benchmark datasets. It is also much faster than the
state-of-the-art weakly supervised text classification method while achieving
similar classification performance. We further prove the equivalence of optimal
transport loss and cross-entropy loss at the global minimum.Comment: 24 pages, 12 figures, ACL findings 202
Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Mining a set of meaningful topics organized into a hierarchy is intuitively
appealing since topic correlations are ubiquitous in massive text corpora. To
account for potential hierarchical topic structures, hierarchical topic models
generalize flat topic models by incorporating latent topic hierarchies into
their generative modeling process. However, due to their purely unsupervised
nature, the learned topic hierarchy often deviates from users' particular needs
or interests. To guide the hierarchical topic discovery process with minimal
user supervision, we propose a new task, Hierarchical Topic Mining, which takes
a category tree described by category names only, and aims to mine a set of
representative terms for each category from a text corpus to help a user
comprehend his/her interested topics. We develop a novel joint tree and text
embedding method along with a principled optimization procedure that allows
simultaneous modeling of the category tree structure and the corpus generative
process in the spherical space for effective category-representative term
discovery. Our comprehensive experiments show that our model, named JoSH, mines
a high-quality set of hierarchical topics with high efficiency and benefits
weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH
A Topical Approach to Capturing Customer Insight In Social Media
The age of social media has opened new opportunities for businesses. This
flourishing wealth of information is outside traditional channels and
frameworks of classical marketing research, including that of Marketing Mix
Modeling (MMM). Textual data, in particular, poses many challenges that data
analysis practitioners must tackle. Social media constitute massive,
heterogeneous, and noisy document sources. Industrial data acquisition
processes include some amount of ETL. However, the variability of noise in the
data and the heterogeneity induced by different sources create the need for
ad-hoc tools. Put otherwise, customer insight extraction in fully unsupervised,
noisy contexts is an arduous task. This research addresses the challenge of
fully unsupervised topic extraction in noisy, Big Data contexts. We present
three approaches we built on the Variational Autoencoder framework: the
Embedded Dirichlet Process, the Embedded Hierarchical Dirichlet Process, and
the time-aware Dynamic Embedded Dirichlet Process. These nonparametric
approaches concerning topics present the particularity of determining word
embeddings and topic embeddings. These embeddings do not require transfer
learning, but knowledge transfer remains possible. We test these approaches on
benchmark and automotive industry-related datasets from a real-world use case.
We show that our models achieve equal to better performance than
state-of-the-art methods and that the field of topic modeling would benefit
from improved evaluation metrics
Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts
Instead of mining coherent topics from a given text corpus in a completely
unsupervised manner, seed-guided topic discovery methods leverage user-provided
seed words to extract distinctive and coherent topics so that the mined topics
can better cater to the user's interest. To model the semantic correlation
between words and seeds for discovering topic-indicative terms, existing
seed-guided approaches utilize different types of context signals, such as
document-level word co-occurrences, sliding window-based local contexts, and
generic linguistic knowledge brought by pre-trained language models. In this
work, we analyze and show empirically that each type of context information has
its value and limitation in modeling word semantics under seed guidance, but
combining three types of contexts (i.e., word embeddings learned from local
contexts, pre-trained language model representations obtained from
general-domain training, and topic-indicative sentences retrieved based on seed
information) allows them to complement each other for discovering quality
topics. We propose an iterative framework, SeedTopicMine, which jointly learns
from the three types of contexts and gradually fuses their context signals via
an ensemble ranking process. Under various sets of seeds and on multiple
datasets, SeedTopicMine consistently yields more coherent and accurate topics
than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202
- …