research

Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation

Abstract

Evaluating Feature Extraction Methods for Biomedical WSD Clint Cuffy, Sam Henry and Bridget McInnes, PhD Virginia Commonwealth University, Richmond, Virginia, USA Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the word’s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method. Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the sense’s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value. Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy. Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy. Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data. [1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp

    Similar works