182 research outputs found

    One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis

    Get PDF
    When learning a new skill, you take advantage of your preexisting skills and knowledge. For instance, if you are a skilled violinist, you will likely have an easier time learning to play cello. Similarly, when learning a new language you take advantage of the languages you already speak. For instance, if your native language is Norwegian and you decide to learn Dutch, the lexical overlap between these two languages will likely benefit your rate of language acquisition. This thesis deals with the intersection of learning multiple tasks and learning multiple languages in the context of Natural Language Processing (NLP), which can be defined as the study of computational processing of human language. Although these two types of learning may seem different on the surface, we will see that they share many similarities. The traditional approach in NLP is to consider a single task for a single language at a time. However, recent advances allow for broadening this approach, by considering data for multiple tasks and languages simultaneously. This is an important approach to explore further as the key to improving the reliability of NLP, especially for low-resource languages, is to take advantage of all relevant data whenever possible. In doing so, the hope is that in the long term, low-resource languages can benefit from the advances made in NLP which are currently to a large extent reserved for high-resource languages. This, in turn, may then have positive consequences for, e.g., language preservation, as speakers of minority languages will have a lower degree of pressure to using high-resource languages. In the short term, answering the specific research questions posed should be of use to NLP researchers working towards the same goal.Comment: PhD thesis, University of Groninge

    Almanac: Retrieval-Augmented Language Models for Clinical Medicine

    Full text link
    Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings

    On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch

    Get PDF
    This dissertation places the study of semantic differences in translation compared to non-translation at the centre of its concerns. To date, much research in Corpus-based Translation Studies has focused on lexical and grammatical phenomena in an attempt to reveal presumed general tendencies of translation. On the semantic level, these general tendencies have rarely been investigated. Therefore, the goal of this study is to explore whether universal tendencies of translation also exist on the semantic level, thereby connecting the framework of translation universals to semantics

    Induction de lexiques bilingues à partir de corpus comparables et parallèles

    Full text link
    Les modèles statistiques tentent de généraliser la connaissance à partir de la fréquence des événements probabilistes présents dans les données. Si plus de données sont disponibles, les événements sont plus souvent observés et les modèles sont plus performants. Les approches du Traitement Automatique de la Langue basées sur ces modèles sont donc dépendantes de la disponibilité et de la quantité des ressources à disposition. Cette dépendance aux données touche en particulier la Traduction Automatique Statistique qui, de surcroît, requiert des ressources de type multilingue. Cette thèse rapporte quatre articles sur deux tâches qui contribuent de près à cette dépendance : l’Alignement de Documents Bilingues (ADB) et l’Induction de Lexiques Bilingues (ILB). La première publication décrit le système soumis à la tâche partagée d’ADB de la conférence WMT16. Développé sur un moteur de recherche, notre système indexe des sites web bilingues et tente d’identifier les pages anglaises-françaises qui sont en relation de traduction. L’alignement est réalisé grâce à la représentation "sac de mots" et un lexique bilingue. L’outil développé nous a permis d’évaluer plus de 1000 configurations et d’en identifier une qui fournit des performances respectables sur la tâche. Les trois autres articles concernent la tâche d’ILB. Le premier revient sur l’approche dite "standard" et propose une exploration en largeur des paramètres dans le contexte du Web Sémantique. Le deuxième article compare l’approche standard avec les plus récentes techniques basées sur les représentations interlingues de mots (embeddings en anglais) issues de réseaux de neurones. La dernière contribution reporte des performances globales améliorées sur la tâche en combinant, par reclassement supervisée, les sorties des deux types d’approches précédemment étudiées.Statistical models try to generalize knowledge starting from the frequency of probabilistic events in the data. If more data is available, events are more often observed and models are more e cient. Natural Language Processing approaches based on those models are therefore dependant on the quantity and availability of these resources. Thus, there is a permanent need for generating and updating the learning data. This dependency touches Statistical Machine Translation, which requires multilingual resources. This thesis refers to four articles tackling two tasks that contribute signi - cantly to this dependency: the Bilingual Documents Alignment (BDA) and the Bilingual Lexicons Induction (BLI). The rst publication describes the system submitted for the BDA shared task of the WMT16 conference. Developed on a search engine, our system indexes bilingual web sites and tries to identify the English-French pages linked by translation. The alignment is realized using a "bag of words" representation and a bilingual lexicon. The tool we have developed allowed us to evaluate more than 1,000 con gurations and identify one yielding decent performances on this particular task. The three other articles are concerned with the BLI task. The rst one focuses on the so-called standard approach, and proposes a breadth parameter exploration in the Semantic Web context. The second article compares the standard approach with more recent techniques based on interlingual representation of words, or the so-called embeddings, issued from neural networks. The last contribution reports the enhanced global performances on the task, combining the outputs of the two studied approaches through supervised reclassification

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Unsupervised multilingual learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading
    • …
    corecore