11,909 research outputs found
Using Word Embeddings in Twitter Election Classification
Word embeddings and convolutional neural networks (CNN)
have attracted extensive attention in various classification
tasks for Twitter, e.g. sentiment classification. However,
the effect of the configuration used to train and generate
the word embeddings on the classification performance has
not been studied in the existing literature. In this paper,
using a Twitter election classification task that aims to detect
election-related tweets, we investigate the impact of
the background dataset used to train the embedding models,
the context window size and the dimensionality of word
embeddings on the classification performance. By comparing
the classification results of two word embedding models,
which are trained using different background corpora
(e.g. Wikipedia articles and Twitter microposts), we show
that the background data type should align with the Twitter
classification dataset to achieve a better performance. Moreover,
by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities,
we found that large context window and dimension
sizes are preferable to improve the performance. Our experimental
results also show that using word embeddings and
CNN leads to statistically significant improvements over various
baselines such as random, SVM with TF-IDF and SVM
with word embeddings
MorphTE: Injecting Morphology in Tensorized Embeddings
In the era of deep learning, word embeddings are essential when dealing with
text tasks. However, storing and accessing these embeddings requires a large
amount of space. This is not conducive to the deployment of these models on
resource-limited devices. Combining the powerful compression capability of
tensor products, we propose a word embedding compression method with
morphological augmentation, Morphologically-enhanced Tensorized Embeddings
(MorphTE). A word consists of one or more morphemes, the smallest units that
bear meaning or have a grammatical function. MorphTE represents a word
embedding as an entangled form of its morpheme vectors via the tensor product,
which injects prior semantic and grammatical knowledge into the learning of
embeddings. Furthermore, the dimensionality of the morpheme vector and the
number of morphemes are much smaller than those of words, which greatly reduces
the parameters of the word embeddings. We conduct experiments on tasks such as
machine translation and question answering. Experimental results on four
translation datasets of different languages show that MorphTE can compress word
embedding parameters by about 20 times without performance loss and
significantly outperforms related embedding compression methods.Comment: 20 pages, 6 figures, 18 tables. Published at NeurIPS 202
Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation
Evaluating Feature Extraction Methods for Biomedical WSD
Clint Cuffy, Sam Henry and Bridget McInnes, PhD
Virginia Commonwealth University, Richmond, Virginia, USA
Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the word’s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method.
Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the sense’s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value.
Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy.
Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy.
Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data.
[1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp
- …