50 research outputs found

    Numeracy of Language Models: Joint Modelling of Words and Numbers

    Get PDF
    Numeracy and literacy are the abilities to understand and work with numbers and words, respectively. While both skills are necessary for reading and writing documents in clinical, scientific, and other technical domains, existing statistical language models focus on words to the expense of numbers: numbers are ignored, masked, or treated similarly to words, which can obscure numerical content and cause sparsity issues, e.g. high out-of-vocabulary rates. In this thesis, we investigate whether the performance of neural language models can be improved by i) considering numerical information as additional inputs and ii) explicitly modelling the output of numerical tokens. In experiments with numbers as input, we find that numerical input features improve perplexity by 33% on a clinical dataset. In assisted text entry and verification tasks, numerical input features improve recall from 25.03% to 71.28% for word prediction with a list of 5 suggestions, keystroke savings from 34.35% to 44.81% for word completion, and F1 metric by 5 points for semantic error correction. Numerical information from an accompanying knowledge base helps improve performance further. In experiments with numerical tokens as output, we consider different strategies, e.g. memorisation and digit-by-digit composition, and propose a novel neural component based on Gaussian mixture density estimation. We propose the use of regression metrics to evaluate numerical accuracy and an adjusted perplexity metric that accounts for the high out-of-vocabulary rate of numerals. Our evaluation on clinical and scientific datasets shows that perplexity can be improved by more than 2 and 4 orders of magnitude, respectively, by modelling words and numerals with different sub-models through a hierarchical softmax. For the same datasets, our proposed mixture of Gaussians model achieved a 32% and 54% reduction of mean average percentage errors over the contender strategy, digit-by-digit composition. We conclude with a critical reflection of this thesis and suggestions for future work

    Racialized Discourse at the Intersection of Meaning, Mind, and Metaphysics

    Get PDF
    Racialized discourse is language that transmits potentially harmful representations of racial groups. It is also a tool for maintaining status quo racial hierarchies. A theory of racialized discourse should describe the form and content of these representations, explain how they are transmitted in communication, and explain how their distribution plays a role in sustaining racial hierarchy. I meet these desiderata via an original account of the semantics of racially stereotypical generics (e.g “Blacks are criminal,” “Muslims are terrorists,” “Immigrants are violent”) and racialized terms deployed in the context of political discourse (e.g “thug,” “terrorist,” “immigrant,” “criminal,” “welfare”). The core semantic hypothesis is that the standards for the use and meaning of racialized vocabulary shift depending on the racial presentation of the individuals and groups described by that vocabulary. This shows that racial discrimination sometimes has a linguistic basis. Next, drawing on an interdisciplinary set of tools offered by philosophy of language, linguistics, developmental and social psychology, political science, and social ontology, I show that these types of racialized discourse i) essentialize racial groups, ii) indirectly increase tolerance for social hierarchy, and iii) play a role in maintaining racial stratification

    Compositionality and Concepts in Linguistics and Psychology

    Get PDF
    cognitive science; semantics; languag

    Knowledge-driven entity recognition and disambiguation in biomedical text

    Get PDF
    Entity recognition and disambiguation (ERD) for the biomedical domain are notoriously difficult problems due to the variety of entities and their often long names in many variations. Existing works focus heavily on the molecular level in two ways. First, they target scientific literature as the input text genre. Second, they target single, highly specialized entity types such as chemicals, genes, and proteins. However, a wealth of biomedical information is also buried in the vast universe of Web content. In order to fully utilize all the information available, there is a need to tap into Web content as an additional input. Moreover, there is a need to cater for other entity types such as symptoms and risk factors since Web content focuses on consumer health. The goal of this thesis is to investigate ERD methods that are applicable to all entity types in scientific literature as well as Web content. In addition, we focus on under-explored aspects of the biomedical ERD problems -- scalability, long noun phrases, and out-of-knowledge base (OOKB) entities. This thesis makes four main contributions, all of which leverage knowledge in UMLS (Unified Medical Language System), the largest and most authoritative knowledge base (KB) of the biomedical domain. The first contribution is a fast dictionary lookup method for entity recognition that maximizes throughput while balancing the loss of precision and recall. The second contribution is a semantic type classification method targeting common words in long noun phrases. We develop a custom set of semantic types to capture word usages; besides biomedical usage, these types also cope with non-biomedical usage and the case of generic, non-informative usage. The third contribution is a fast heuristics method for entity disambiguation in MEDLINE abstracts, again maximizing throughput but this time maintaining accuracy. The fourth contribution is a corpus-driven entity disambiguation method that addresses OOKB entities. The method first captures the entities expressed in a corpus as latent representations that comprise in-KB and OOKB entities alike before performing entity disambiguation.Die Erkennung und Disambiguierung von EntitĂ€ten fĂŒr den biomedizinischen Bereich stellen, wegen der vielfĂ€ltigen Arten von biomedizinischen EntitĂ€ten sowie deren oft langen und variantenreichen Namen, große Herausforderungen dar. Vorhergehende Arbeiten konzentrieren sich in zweierlei Hinsicht fast ausschließlich auf molekulare EntitĂ€ten. Erstens fokussieren sie sich auf wissenschaftliche Publikationen als Genre der Eingabetexte. Zweitens fokussieren sie sich auf einzelne, sehr spezialisierte EntitĂ€tstypen wie Chemikalien, Gene und Proteine. Allerdings bietet das Internet neben diesen Quellen eine Vielzahl an Inhalten biomedizinischen Wissens, das vernachlĂ€ssigt wird. Um alle verfĂŒgbaren Informationen auszunutzen besteht der Bedarf weitere Internet-Inhalte als zusĂ€tzliche Quellen zu erschließen. Außerdem ist es auch erforderlich andere EntitĂ€tstypen wie Symptome und Risikofaktoren in Betracht zu ziehen, da diese fĂŒr zahlreiche Inhalte im Internet, wie zum Beispiel Verbraucherinformationen im Gesundheitssektor, relevant sind. Das Ziel dieser Dissertation ist es, Methoden zur Erkennung und Disambiguierung von EntitĂ€ten zu erforschen, die alle EntitĂ€tstypen in Betracht ziehen und sowohl auf wissenschaftliche Publikationen als auch auf andere Internet-Inhalte anwendbar sind. DarĂŒber hinaus setzen wir Schwerpunkte auf oft vernachlĂ€ssigte Aspekte der biomedizinischen Erkennung und Disambiguierung von EntitĂ€ten, nĂ€mlich Skalierbarkeit, lange Nominalphrasen und fehlende EntitĂ€ten in einer Wissensbank. In dieser Hinsicht leistet diese Dissertation vier HauptbeitrĂ€ge, denen allen das Wissen von UMLS (Unified Medical Language System), der grĂ¶ĂŸten und wichtigsten Wissensbank im biomedizinischen Bereich, zu Grunde liegt. Der erste Beitrag ist eine schnelle Methode zur Erkennung von EntitĂ€ten mittels Lexikonabgleich, welche den Durchsatz maximiert und gleichzeitig den Verlust in Genauigkeit und Trefferquote (precision and recall) balanciert. Der zweite Beitrag ist eine Methode zur Klassifizierung der semantischen Typen von Nomen, die sich auf gebrĂ€uchliche Nomen von langen Nominalphrasen richtet und auf einer selbstentwickelten Sammlung von semantischen Typen beruht, die die Verwendung der Nomen erfasst. Neben biomedizinischen können diese Typen auch nicht-biomedizinische und allgemeine, informationsarme Verwendungen behandeln. Der dritte Beitrag ist eine schnelle Heuristikmethode zur Disambiguierung von EntitĂ€ten in MEDLINE Kurzfassungen, welche den Durchsatz maximiert, aber auch die Genauigkeit erhĂ€lt. Der vierte Beitrag ist eine korpusgetriebene Methode zur Disambiguierung von EntitĂ€ten, die speziell fehlende EntitĂ€ten in einer Wissensbank behandelt. Die Methode wandelt erst die EntitĂ€ten, die in einem Textkorpus ausgedrĂŒckt aber nicht notwendigerweise in einer Wissensbank sind, in latente Darstellungen um und fĂŒhrt anschließend die Disambiguierung durch
    corecore