30 research outputs found
Word and Document Embedding with vMF-Mixture Priors on Context Word Vectors
Word embedding models typically learn two types of vectors: target word vectors and context word vectors. These vectors are normally learned such that they are predictive of some word co-occurrence statistic, but they are otherwise unconstrained. However, the words from a given language can be organized in various natural groupings, such as syntactic word classes (e.g. nouns, adjectives, verbs) and semantic themes (e.g. sports, politics, sentiment). Our hypothesis in this paper is that embedding models can be improved by explicitly imposing a cluster structure on the set of context word vectors. To this end, our model relies on the assumption that context word vectors are drawn from a mixture of von Mises- Fisher (vMF) distributions, where the parameters of this mixture distribution are jointly optimized with the word vectors. We show that this results in word vectors which are qualitatively different from those obtained with existing word embedding models. We furthermore show that our embedding model can also be used to learn high-quality document representations
Weakly-Supervised Neural Text Classification
Deep neural networks are gaining increasing popularity for the classic text
classification task, due to their strong expressive power and less requirement
for feature engineering. Despite such attractiveness, neural text
classification models suffer from the lack of training data in many real-world
applications. Although many semi-supervised and weakly-supervised text
classification models exist, they cannot be easily applied to deep neural
models and meanwhile support limited supervision types. In this paper, we
propose a weakly-supervised method that addresses the lack of training data in
neural text classification. Our method consists of two modules: (1) a
pseudo-document generator that leverages seed information to generate
pseudo-labeled documents for model pre-training, and (2) a self-training module
that bootstraps on real unlabeled data for model refinement. Our method has the
flexibility to handle different types of weak supervision and can be easily
integrated into existing deep neural models for text classification. We have
performed extensive experiments on three real-world datasets from different
domains. The results demonstrate that our proposed method achieves inspiring
performance without requiring excessive training data and outperforms baseline
methods significantly.Comment: CIKM 2018 Full Pape
Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts
Instead of mining coherent topics from a given text corpus in a completely
unsupervised manner, seed-guided topic discovery methods leverage user-provided
seed words to extract distinctive and coherent topics so that the mined topics
can better cater to the user's interest. To model the semantic correlation
between words and seeds for discovering topic-indicative terms, existing
seed-guided approaches utilize different types of context signals, such as
document-level word co-occurrences, sliding window-based local contexts, and
generic linguistic knowledge brought by pre-trained language models. In this
work, we analyze and show empirically that each type of context information has
its value and limitation in modeling word semantics under seed guidance, but
combining three types of contexts (i.e., word embeddings learned from local
contexts, pre-trained language model representations obtained from
general-domain training, and topic-indicative sentences retrieved based on seed
information) allows them to complement each other for discovering quality
topics. We propose an iterative framework, SeedTopicMine, which jointly learns
from the three types of contexts and gradually fuses their context signals via
an ensemble ranking process. Under various sets of seeds and on multiple
datasets, SeedTopicMine consistently yields more coherent and accurate topics
than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202
A neural generative model for joint learning topics and topic-specific word embeddings
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings. In particular, we assume that global latent topics are shared across documents; a word is generated by a hidden semantic vector encoding its contextual semantic meaning; and its context words are generated conditional on both the hidden semantic vector and global latent topics. Topics are trained jointly with the word embeddings. The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy. Experimental results show that the proposed model outperforms the word-level embedding methods in both word similarity evaluation and word sense disambiguation. Furthermore, the model also extracts more coherent topics compared to existing neural topic models or other models for joint learning of topics and word embeddings. Finally, the model can be easily integrated with existing deep contextualized word embedding learning methods to further improve the performance of downstream tasks such as sentiment classification
Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Mining a set of meaningful topics organized into a hierarchy is intuitively
appealing since topic correlations are ubiquitous in massive text corpora. To
account for potential hierarchical topic structures, hierarchical topic models
generalize flat topic models by incorporating latent topic hierarchies into
their generative modeling process. However, due to their purely unsupervised
nature, the learned topic hierarchy often deviates from users' particular needs
or interests. To guide the hierarchical topic discovery process with minimal
user supervision, we propose a new task, Hierarchical Topic Mining, which takes
a category tree described by category names only, and aims to mine a set of
representative terms for each category from a text corpus to help a user
comprehend his/her interested topics. We develop a novel joint tree and text
embedding method along with a principled optimization procedure that allows
simultaneous modeling of the category tree structure and the corpus generative
process in the spherical space for effective category-representative term
discovery. Our comprehensive experiments show that our model, named JoSH, mines
a high-quality set of hierarchical topics with high efficiency and benefits
weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH
Text mining with word embedding for outlier and sentiment analysis
The technology today makes it unprecedentedly easy to collect and store massive text data in various domains such as online social networks, medical records and news reports. In contrast to the gigantic volume of text data, human capabilities to read and process text data is limited. Hence, there is an emerging demand for automatic text mining tools to analyze massive text data.
Word embedding is an emerging text analysis technique that leverages the fine-grained statistics of context information to map each word to a vector in the embedding space which reflects the semantic proximity between words. Embedding techniques not only enrich the statistical signals to utilize in downstream text mining applications, but also provide the possibility to characterize and represent higher-level objects in the embedding space, such as sentences, documents or topics.
This study integrates word embedding techniques into a series of text mining approaches and models. The general idea is to take a text object such as a document or a sentence as a bag of embedding vectors and characterize their distributions in the embedding space. Specifically, this study focuses on two tasks: outlier analysis and weakly-supervised sentiment analysis.
Outlier analysis aims to identify documents that topically deviate from the majority of a given corpus. We develop an unsupervised generative model to identify frequent and representative semantic regions in the word embedding space to represent the given corpus. Then we propose a novel outlierness measure to identify outlier documents. We also study the cost-sensitive scenario of outlier analysis.
Sentiment analysis typically identifies the subjective opinion (e.g., positive vs. negative) in a piece of text. Despite being extensively studied as a supervised learning task, we tackle the problem in a weakly-supervised fashion, where users only provide a small set of seed words as guidance. We study to identify aspects and corresponding sentiments at both document and sentence level
Geometric Inference in Bayesian Hierarchical Models with Applications to Topic Modeling
Unstructured data is available in abundance with the rapidly growing size of digital information. Labeling such data is expensive and impractical, making unsupervised learning an increasingly important field. Big data collections often have rich latent structure that statistical modeler is challenged to uncover. Bayesian hierarchical modeling is a particularly suitable approach for complex latent patterns. Graphical model formalism has been prominent in developing various procedures for inference in Bayesian models, however the corresponding computational limits often fall behind the demands of the modern data sizes. In this thesis we develop new approaches for scalable approximate Bayesian inference. In particular, our approaches are driven by the analysis of latent geometric structures induced by the models.
Our specific contributions include the following. We develop full geometric recipe of the Latent Dirichlet Allocation topic model. Next, we study several approaches for exploiting the latent geometry to first arrive at a fast weighted clustering procedure augmented with geometric corrections for topic inference, and then a nonparametric approach based on the analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Estimates produced by our methods are shown to be statistically consistent under some conditions. Finally, we develop a series of models for temporal dynamics of the latent geometric structures where inference can be performed in online and distributed fashion. All our algorithms are evaluated with extensive experiments on simulated and real datasets, culminating at a method several orders of magnitude faster than existing state-of-the-art topic modeling approaches, as demonstrated by experiments working with several million documents in a dozen minutes.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146051/1/moonfolk_1.pd