16 research outputs found

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten. Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt

    A system for automated lexical mapping

    Get PDF
    Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2005.Includes bibliographical references (leaves 19-20).Merging of clinical systems and medical databases, or aggregation of information from disparate databases, frequently requires a process where vocabularies are compared and similar concepts are mapped. Using a normalization phase followed by a novel alignment stage inspired by DNA sequence alignment methods, automated lexical mapping can map terms from various databases to standard vocabularies such as UMLS (Unified Medical Language System) and SNOMED (the Systematized Nomenclature of Medicine). This automated lexical mapping was evaluated using a real-world database of consultation letters from Children's Hospital Boston. The first phase involved extracting the reason for referral from the consultation letters. The reasons for referral were then mapped to SNOMED. The alignment algorithm was able to map 72% of equivalent concepts through lexical mapping alone. Lexical mapping can facilitate the integration of data from diverse sources and decrease the time and cost required for manual mapping and integration of clinical systems and medical databases.by Jennifer Y. Sun.S.M

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    Building and Evaluating Open-Vocabulary Language Models

    Get PDF
    Language models have always been a fundamental NLP tool and application. This thesis focuses on open-vocabulary language models, i.e., models that can deal with novel and unknown words at runtime. We will propose both new ways to construct such models as well as use such models in cross-linguistic evaluations to answer questions of difficulty and language-specificity in modern NLP tools. We start by surveying linguistic background as well as past and present NLP approaches to tokenization and open-vocabulary language modeling (Mielke et al., 2021). Thus equipped, we establish desirable principles for such models, both from an engineering mindset as well as a linguistic one and hypothesize a model based on the marriage of neural language modeling and Bayesian nonparametrics to handle a truly infinite vocabulary, boasting attractive theoretical properties and mathematical soundness, but presenting practical implementation difficulties. As a compromise, we thus introduce a word-based two-level language model that still has many desirable characteristics while being highly feasible to run (Mielke and Eisner, 2019). Unlike the more dominant approaches of characters or subword units as one-layer tokenization it uses words; its key feature is the ability to generate novel words in context and in isolation. Moving on to evaluation, we ask: how do such models deal with the wide variety of languages of the world---are they struggling with some languages? Relating this question to a more linguistic one, are some languages inherently more difficult to deal with? Using simple methods, we show that indeed they are, starting with a small pilot study that suggests typological predictors of difficulty (Cotterell et al., 2018). Thus encouraged, we design a far bigger study with more powerful methodology, a principled and highly feasible evaluation and comparison scheme based again on multi-text likelihood (Mielke et al., 2019). This larger study shows that the earlier conclusion of typological predictors is difficult to substantiate, but also offers a new insight on the complexity of Translationese. Following that theme, we end by extending this scheme to machine translation models to answer questions traditional evaluation metrics like BLEU cannot (Bugliarello et al., 2020)

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
    corecore