5 research outputs found

    Exploiting domain information for Word Sense Disambiguation of medical documents

    Get PDF
    OBJECTIVE: Current techniques for knowledge-based Word Sense Disambiguation (WSD) of ambiguous biomedical terms rely on relations in the Unified Medical Language System Metathesaurus but do not take into account the domain of the target documents. The authors' goal is to improve these methods by using information about the topic of the document in which the ambiguous term appears. DESIGN: The authors proposed and implemented several methods to extract lists of key terms associated with Medical Subject Heading terms. These key terms are used to represent the document topic in a knowledge-based WSD system. They are applied both alone and in combination with local context. MEASUREMENTS: A standard measure of accuracy was calculated over the set of target words in the widely used National Library of Medicine WSD dataset. RESULTS AND DISCUSSION: The authors report a significant improvement when combining those key terms with local context, showing that domain information improves the results of a WSD system based on the Unified Medical Language System Metathesaurus alone. The best results were obtained using key terms obtained by relevance feedback and weighted by inverse document frequency

    Collocation analysis for UMLS knowledge-based word sense disambiguation

    Get PDF
    BACKGROUND: The effectiveness of knowledge-based word sense disambiguation (WSD) approaches depends in part on the information available in the reference knowledge resource. Off the shelf, these resources are not optimized for WSD and might lack terms to model the context properly. In addition, they might include noisy terms which contribute to false positives in the disambiguation results. METHODS: We analyzed some collocation types which could improve the performance of knowledge-based disambiguation methods. Collocations are obtained by extracting candidate collocations from MEDLINE and then assigning them to one of the senses of an ambiguous word. We performed this assignment either using semantic group profiles or a knowledge-based disambiguation method. In addition to collocations, we used second-order features from a previously implemented approach.Specifically, we measured the effect of these collocations in two knowledge-based WSD methods. The first method, AEC, uses the knowledge from the UMLS to collect examples from MEDLINE which are used to train a NaĂŻve Bayes approach. The second method, MRD, builds a profile for each candidate sense based on the UMLS and compares the profile to the context of the ambiguous word.We have used two WSD test sets which contain disambiguation cases which are mapped to UMLS concepts. The first one, the NLM WSD set, was developed manually by several domain experts and contains words with high frequency occurrence in MEDLINE. The second one, the MSH WSD set, was developed automatically using the MeSH indexing in MEDLINE. It contains a larger set of words and covers a larger number of UMLS semantic types. RESULTS: The results indicate an improvement after the use of collocations, although the approaches have different performance depending on the data set. In the NLM WSD set, the improvement is larger for the MRD disambiguation method using second-order features. Assignment of collocations to a candidate sense based on UMLS semantic group profiles is more effective in the AEC method.In the MSH WSD set, the increment in performance is modest for all the methods. Collocations combined with the MRD disambiguation method have the best performance. The MRD disambiguation method and second-order features provide an insignificant change in performance. The AEC disambiguation method gives a modest improvement in performance. Assignment of collocations to a candidate sense based on knowledge-based methods has better performance. CONCLUSIONS: Collocations improve the performance of knowledge-based disambiguation methods, although results vary depending on the test set and method used. Generally, the AEC method is sensitive to query drift. Using AEC, just a few selected terms provide a large improvement in disambiguation performance. The MRD method handles noisy terms better but requires a larger set of terms to improve performance

    Knowledge-based biomedical word sense disambiguation: comparison of approaches

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain.</p> <p>Methods</p> <p>We present research on existing WSD approaches based on knowledge bases, which complement the studies performed on statistical learning. We compare four approaches which rely on the UMLS Metathesaurus as the source of knowledge. The first approach compares the overlap of the context of the ambiguous word to the candidate senses based on a representation built out of the definitions, synonyms and related terms. The second approach collects training data for each of the candidate senses to perform WSD based on queries built using monosemous synonyms and related terms. These queries are used to retrieve MEDLINE citations. Then, a machine learning approach is trained on this corpus. The third approach is a graph-based method which exploits the structure of the Metathesaurus network of relations to perform unsupervised WSD. This approach ranks nodes in the graph according to their relative structural importance. The last approach uses the semantic types assigned to the concepts in the Metathesaurus to perform WSD. The context of the ambiguous word and semantic types of the candidate concepts are mapped to Journal Descriptors. These mappings are compared to decide among the candidate concepts. Results are provided estimating accuracy of the different methods on the WSD test collection available from the NLM.</p> <p>Conclusions</p> <p>We have found that the last approach achieves better results compared to the other methods. The graph-based approach, using the structure of the Metathesaurus network to estimate the relevance of the Metathesaurus concepts, does not perform well compared to the first two methods. In addition, the combination of methods improves the performance over the individual approaches. On the other hand, the performance is still below statistical learning trained on manually produced data and below the maximum frequency sense baseline. Finally, we propose several directions to improve the existing methods and to improve the Metathesaurus to be more effective in WSD.</p

    Knowledge-driven entity recognition and disambiguation in biomedical text

    Get PDF
    Entity recognition and disambiguation (ERD) for the biomedical domain are notoriously difficult problems due to the variety of entities and their often long names in many variations. Existing works focus heavily on the molecular level in two ways. First, they target scientific literature as the input text genre. Second, they target single, highly specialized entity types such as chemicals, genes, and proteins. However, a wealth of biomedical information is also buried in the vast universe of Web content. In order to fully utilize all the information available, there is a need to tap into Web content as an additional input. Moreover, there is a need to cater for other entity types such as symptoms and risk factors since Web content focuses on consumer health. The goal of this thesis is to investigate ERD methods that are applicable to all entity types in scientific literature as well as Web content. In addition, we focus on under-explored aspects of the biomedical ERD problems -- scalability, long noun phrases, and out-of-knowledge base (OOKB) entities. This thesis makes four main contributions, all of which leverage knowledge in UMLS (Unified Medical Language System), the largest and most authoritative knowledge base (KB) of the biomedical domain. The first contribution is a fast dictionary lookup method for entity recognition that maximizes throughput while balancing the loss of precision and recall. The second contribution is a semantic type classification method targeting common words in long noun phrases. We develop a custom set of semantic types to capture word usages; besides biomedical usage, these types also cope with non-biomedical usage and the case of generic, non-informative usage. The third contribution is a fast heuristics method for entity disambiguation in MEDLINE abstracts, again maximizing throughput but this time maintaining accuracy. The fourth contribution is a corpus-driven entity disambiguation method that addresses OOKB entities. The method first captures the entities expressed in a corpus as latent representations that comprise in-KB and OOKB entities alike before performing entity disambiguation.Die Erkennung und Disambiguierung von Entitäten für den biomedizinischen Bereich stellen, wegen der vielfältigen Arten von biomedizinischen Entitäten sowie deren oft langen und variantenreichen Namen, große Herausforderungen dar. Vorhergehende Arbeiten konzentrieren sich in zweierlei Hinsicht fast ausschließlich auf molekulare Entitäten. Erstens fokussieren sie sich auf wissenschaftliche Publikationen als Genre der Eingabetexte. Zweitens fokussieren sie sich auf einzelne, sehr spezialisierte Entitätstypen wie Chemikalien, Gene und Proteine. Allerdings bietet das Internet neben diesen Quellen eine Vielzahl an Inhalten biomedizinischen Wissens, das vernachlässigt wird. Um alle verfügbaren Informationen auszunutzen besteht der Bedarf weitere Internet-Inhalte als zusätzliche Quellen zu erschließen. Außerdem ist es auch erforderlich andere Entitätstypen wie Symptome und Risikofaktoren in Betracht zu ziehen, da diese für zahlreiche Inhalte im Internet, wie zum Beispiel Verbraucherinformationen im Gesundheitssektor, relevant sind. Das Ziel dieser Dissertation ist es, Methoden zur Erkennung und Disambiguierung von Entitäten zu erforschen, die alle Entitätstypen in Betracht ziehen und sowohl auf wissenschaftliche Publikationen als auch auf andere Internet-Inhalte anwendbar sind. Darüber hinaus setzen wir Schwerpunkte auf oft vernachlässigte Aspekte der biomedizinischen Erkennung und Disambiguierung von Entitäten, nämlich Skalierbarkeit, lange Nominalphrasen und fehlende Entitäten in einer Wissensbank. In dieser Hinsicht leistet diese Dissertation vier Hauptbeiträge, denen allen das Wissen von UMLS (Unified Medical Language System), der größten und wichtigsten Wissensbank im biomedizinischen Bereich, zu Grunde liegt. Der erste Beitrag ist eine schnelle Methode zur Erkennung von Entitäten mittels Lexikonabgleich, welche den Durchsatz maximiert und gleichzeitig den Verlust in Genauigkeit und Trefferquote (precision and recall) balanciert. Der zweite Beitrag ist eine Methode zur Klassifizierung der semantischen Typen von Nomen, die sich auf gebräuchliche Nomen von langen Nominalphrasen richtet und auf einer selbstentwickelten Sammlung von semantischen Typen beruht, die die Verwendung der Nomen erfasst. Neben biomedizinischen können diese Typen auch nicht-biomedizinische und allgemeine, informationsarme Verwendungen behandeln. Der dritte Beitrag ist eine schnelle Heuristikmethode zur Disambiguierung von Entitäten in MEDLINE Kurzfassungen, welche den Durchsatz maximiert, aber auch die Genauigkeit erhält. Der vierte Beitrag ist eine korpusgetriebene Methode zur Disambiguierung von Entitäten, die speziell fehlende Entitäten in einer Wissensbank behandelt. Die Methode wandelt erst die Entitäten, die in einem Textkorpus ausgedrückt aber nicht notwendigerweise in einer Wissensbank sind, in latente Darstellungen um und führt anschließend die Disambiguierung durch
    corecore