1,768 research outputs found
Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation
Evaluating Feature Extraction Methods for Biomedical WSD
Clint Cuffy, Sam Henry and Bridget McInnes, PhD
Virginia Commonwealth University, Richmond, Virginia, USA
Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the word’s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method.
Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the sense’s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value.
Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy.
Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy.
Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data.
[1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp
BIOMEDICAL WORD SENSE DISAMBIGUATION WITH NEURAL WORD AND CONCEPT EMBEDDINGS
Addressing ambiguity issues is an important step in natural language processing (NLP) pipelines designed for information extraction and knowledge discovery. This problem is also common in biomedicine where NLP applications have become indispensable to exploit latent information from biomedical literature and clinical narratives from electronic medical records. In this thesis, we propose an ensemble model that employs recent advances in neural word embeddings along with knowledge based approaches to build a biomedical word sense disambiguation (WSD) system. Specifically, our system identities the correct sense from a given set of candidates for each ambiguous word when presented in its context (surrounding words). We use the MSH WSD dataset, a well known public dataset consisting of 203 ambiguous terms each with nearly 200 different instances and an average of two candidate senses represented by concepts in the unified medical language system (UMLS). We employ a popular biomedical concept, Our linear time (in terms of number of senses and context length) unsupervised and knowledge based approach improves over the state-of-the-art methods by over 3% in accuracy. A more expensive approach based on the k-nearest neighbor framework improves over prior best results by 5% in accuracy. Our results demonstrate that recent advances in neural dense word vector representations offer excellent potential for solving biomedical WSD
Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
Interpretability of a predictive model is a powerful feature that gains the
trust of users in the correctness of the predictions. In word sense
disambiguation (WSD), knowledge-based systems tend to be much more
interpretable than knowledge-free counterparts as they rely on the wealth of
manually-encoded elements representing word senses, such as hypernyms, usage
examples, and images. We present a WSD system that bridges the gap between
these two so far disconnected groups of methods. Namely, our system, providing
access to several state-of-the-art WSD models, aims to be interpretable as a
knowledge-based system while it remains completely unsupervised and
knowledge-free. The presented tool features a Web interface for all-word
disambiguation of texts that makes the sense predictions human readable by
providing interpretable word sense inventories, sense representations, and
disambiguation results. We provide a public API, enabling seamless integration.Comment: In Proceedings of the the Conference on Empirical Methods on Natural
Language Processing (EMNLP 2017). 2017. Copenhagen, Denmark. Association for
Computational Linguistic
Embedding Words and Senses Together via Joint Knowledge-Enhanced Training
Word embeddings are widely used in Nat-ural Language Processing, mainly due totheir success in capturing semantic infor-mation from massive corpora. However,their creation process does not allow thedifferent meanings of a word to be auto-matically separated, as it conflates theminto a single vector. We address this issueby proposing a new model which learnsword and sense embeddings jointly. Ourmodel exploits large corpora and knowl-edge from semantic networks in order toproduce a unified vector space of wordand sense embeddings. We evaluate themain features of our approach both qual-itatively and quantitatively in a variety oftasks, highlighting the advantages of theproposed method in comparison to state-of-the-art word- and sense-based models
- …