123 research outputs found

    Linguistic knowledge-based vocabularies for Neural Machine Translation

    Get PDF
    This article has been published in a revised form in Natural Language Engineering https://doi.org/10.1017/S1351324920000364. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © Cambridge University PressNeural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial PhD Grant. This work is also supported in part by the Spanish Ministerio de Economa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacin, through the postdoctoral senior grant Ramn y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).Peer ReviewedPostprint (author's final draft

    Modeling contextual information in neural machine translation

    Get PDF
    Machine translation has provided impressive translation quality for many language pairs. The improvements over the past few years are largely due to the introduction of neural networks to the field, resulting in the modern sequence-to-sequence neural machine translation models. NMT is at the core of many largescale industrial tools for automatic translation such as Google Translate, Microsoft Translator, Amazon Translate and many others. Current NMT models work on the sentence-level, meaning they are used to translate individual sentences. However, for most practical use-cases, a user is interested in translating a document. In these cases, an MT tool splits a document into individual sentences and translates them independently. As a result, any dependencies between the sentences are ignored. This is likely to result in an incoherent document translation, mainly because of inconsistent translation of ambiguous source words or wrong translation of anaphoric pronouns. For example, it is undesirable to translate “bank” as a “financial bank” in one sentence and then later as a “river bank”. Furthermore, the translation of, e.g., the English third person pronoun “it” into German depends on the grammatical gender of the English antecedent’s German translation. NMT has shown that it has impressive modeling capabilities, but is nevertheless unable to model discourse-level phenomena as it needs access to contextual information. In this work, we study discourse-level phenomena in context-aware NMT. To facilitate the particular studies of interest, we propose several models capable of incorporating contextual information into standard sentence-level NMT models. We direct our focus on several discourse phenomena, namely, coreference (anaphora) resolution, coherence and cohesion. We discuss these phenomena in terms of how well can they be modeled by context-aware NMT, how can we improve upon current state-of-the-art as well as the optimal granularity at which these phenomena should be modeled. We further investigate domain as a factor in context-aware NMT. Finally, we investigate existing challenge sets for anaphora resolution evaluation and provide a robust alternative. We make the following contributions: i) We study the importance of coreference (anaphora) resolution and coherence for context-aware NMT by making use of oracle information specific to these phenomena. ii) We propose a method for improving performance on anaphora resolution based on curriculum learning which is inspired by the way humans organize learning. iii) We investigate the use of contextual information for better handling of domain information, in particular in the case of modeling multiple domains at once and when applied to zero-resource domains. iv) We present several context-aware models to enable us to examine the specific phenomena of interest we already mentioned. v) We study the optimal way of modeling local and global context and present a model theoretically capable of using very large document context. vi) We study the robustness of challenge sets for evaluation of anaphora resolution in MT by means of adversarial attacks and provide a template test set that robustly evaluates specific steps of an idealized coreference resolution pipeline for MT

    REVISITING RECOGNIZING TEXTUAL ENTAILMENT FOR EVALUATING NATURAL LANGUAGE PROCESSING SYSTEMS

    Get PDF
    Recognizing Textual Entailment (RTE) began as a unified framework to evaluate the reasoning capabilities of Natural Language Processing (NLP) models. In recent years, RTE has evolved in the NLP community into a task that researchers focus on developing models for. This thesis revisits the tradition of RTE as an evaluation framework for NLP models, especially in the era of deep learning. Chapter 2 provides an overview of different approaches to evaluating NLP sys- tems, discusses prior RTE datasets, and argues why many of them do not serve as satisfactory tests to evaluate the reasoning capabilities of NLP systems. Chapter 3 presents a new large-scale diverse collection of RTE datasets (DNC) that tests how well NLP systems capture a range of semantic phenomena that are integral to un- derstanding human language. Chapter 4 demonstrates how the DNC can be used to evaluate reasoning capabilities of NLP models. Chapter 5 discusses the limits of RTE as an evaluation framework by illuminating how existing datasets contain biases that may enable crude modeling approaches to perform surprisingly well. The remaining aspects of the thesis focus on issues raised in Chapter 5. Chapter 6 addresses issues in prior RTE datasets focused on paraphrasing and presents a high-quality test set that can be used to analyze how robust RTE systems are to paraphrases. Chapter 7 demonstrates how modeling approaches on biases, e.g. adversarial learning, can enable RTE models overcome biases discussed in Chapter 5. Chapter 8 applies these methods to the task of discovering emergency needs during disaster events

    LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training

    Full text link
    Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.Comment: Work in progres
    corecore