Dealing with Acronyms in Biomedical Texts

Abstract

Recently, there has been a growth in the amount of machine readable information pertaining to the biomedical field. With this growth comes a desire to be able to extract information, answer questions, etc. based on the information in the documents. Many of these desired tasks require sophisticated language processing algorithms, such as part-of-speech tagging, parsing, and semantic interpretation. In order to use these algorithms the text must first be cleansed of acronyms, abbreviations, and misspellings. In this paper we look at identifying, expanding, and disambiguating acronyms in biomedical texts. We present an integrated system that combines previously used methods for dealing with acronyms and Natural Language Processing techniques in new way for a new domain. The result is an integrated system that achieves a high precision and recall. We break the task up into three modular steps: Identification, Expansion, and Disambiguation. During identification, each word is examined to determine if it is an acronym or not. For this, a hybrid approach that is composed of a Naive Bayesian classifier and a set of handcrafted rules is used. We are able to achieve results of 99.96 % accuracy with a small training set. During the expansion step, a list of possible meanings for the words determined to be acronyms is created. We break the expansion up into two categories, local and global expansion. For local expansion we use windowing and longest common subsequence to generate the possible expansions. Global expansion requires an acronym database to retrieve the possible expansions. The disambiguation step takes the list of possible meanings and determines which meaning is the correct one. To disambiguate the different candidate expansions we use WordNet and semantic similarity. Overall we obtain a recall and precision of over 91%. Keywords: Acronyms, Text Cleansing, Bioinformatic

    Similar works

    Full text

    thumbnail-image

    Available Versions