20 research outputs found

    SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

    Get PDF
    Neural sequence-to-sequence models are currently the dominant approach in several natural language processing tasks, but require large parallel corpora. We present a sequence-to-sequence-to-sequence autoencoder (SEQ^3), consisting of two chained encoder-decoder pairs, with words used as a sequence of discrete latent variables. We apply the proposed model to unsupervised abstractive sentence compression, where the first and last sequences are the input and reconstructed sentences, respectively, while the middle sequence is the compressed sentence. Constraining the length of the latent word sequences forces the model to distill important information from the input. A pretrained language model, acting as a prior over the latent sequences, encourages the compressed sentences to be human-readable. Continuous relaxations enable us to sample from categorical distributions, allowing gradient-based optimization, unlike alternatives that rely on reinforcement learning. The proposed model does not require parallel text-summary pairs, achieving promising results in unsupervised sentence compression on benchmark datasets.Comment: Accepted to NAACL 201

    Bridging the data gap in neural machine translation

    Get PDF
    Neural machine translation (NMT) has completely revolutionized the field, leading to many breakthroughs and significantly improving translation quality. Despite these advancements, a common limitation of existing NMT architectures is that they rely heavily on large amounts of high-quality parallel corpora. However, this requirement is met by only a few high-resource languages, whereas sufficient parallel data is scarce for most of the world's languages. This thesis proposes solutions to this challenge by exploiting two alternative data sources: monolingual data and parallel data from other (related) languages. The first half of the thesis explores how monolingual data can compensate for the lack of parallel data in two distinct ways. We first explore how to effectively exploit the knowledge of language models (LMs) trained on target-side monolingual data. We propose a method that uses an LM as a prior that simultaneously mitigates overfitting and distills the knowledge of the LM into the NMT model. This is achieved by adding a regularization term, which pushes the output distributions of the NMT model to be probable under the LM prior. This improves low-resource translation and outperforms related LM-fusion methods. Next, inspired by advancements in transfer learning, we study how to effectively use monolingual data by pretraining the entire NMT model. We focus on the role of different denoising autoencoding (DAE) objectives and explore noising methods that create samples resembling real sentences. Our analysis reveals that different objectives produce models that encode and use information differently, and our experiments show a strong variation in unsupervised NMT, unlike semi- and supervised NMT. The next part of the thesis focuses on exploiting related parallel data via multilingual machine translation (MMT). Initially, we investigate how to efficiently balance the trade-off between transfer and interference in MMT. Instead of increasing model capacity, which incurs a large computational cost, or using separate language-specific parameters, which prevent cross-lingual transfer, we achieve the best of both by incorporating language-specific layers generated from a language-aware hyper-network. Then, we unify all our previous efforts and study how to optimally combine monolingual and related parallel data in MMT. Motivated by promising and conflicting results in the literature, we systematically analyze jointly training MMT with DAE or back-translation (BT). Using a comprehensive evaluation across monolingual splits and multilingual test sets, we discover that all methods are surprisingly brittle to domain mismatches. We also analyze the role of the model scale (from 90M to 1.6B parameters) and find it critical for effectively using monolingual data and capable of completely changing the ranking across models, with surprisingly strong effects on DAE. The goal of this thesis is to contribute both new methods and new insights. One half presents novel methods for exploiting data sources beyond the parallel corpora of a given language pair, by addressing the limitations of existing methods. The other half presents systematic analyses of how state-of-the-art methods work, by using comprehensive evaluation with controlled experiments, that aims to advance our understanding of these methods and drive future research

    When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale

    Full text link
    Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT.Comment: Work in progres

    Language Model Prior for Low-Resource Neural Machine Translation

    Get PDF
    The scarcity of large parallel corpora is an important obstacle for neural machine translation. A common solution is to exploit the knowledge of language models (LM) trained on abundant monolingual data. In this work, we propose a novel approach to incorporate a LM as prior in a neural translation model (TM). Specifically, we add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior, while avoiding wrong predictions when the TM "disagrees" with the LM. This objective relates to knowledge distillation, where the LM can be viewed as teaching the TM about the target language. The proposed approach does not compromise decoding speed, because the LM is used only at training time, unlike previous work that requires it during inference. We present an analysis of the effects that different methods have on the distributions of the TM. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data
    corecore