2,046 research outputs found

    Benefits of data augmentation for NMT-based text normalization of user-generated content

    Get PDF
    One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a publicly available tiny parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that a combination of both approaches leads to the best results

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202
    corecore