250 research outputs found

    Achieving Human Parity on Automatic Chinese to English News Translation

    Full text link
    Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft's machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations

    Multi-Source Neural Machine Translation with Data Augmentation

    Full text link
    Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have corpora with parallel text in multiple sources and the target language. However, these corpora are rarely complete in practice due to the difficulty of providing human translations in all of the relevant languages. In this paper, we propose a data augmentation approach to fill such incomplete parts using multi-source neural machine translation (NMT). In our experiments, results varied over different language combinations but significant gains were observed when using a source language similar to the target language.Comment: 15th International Workshop on Spoken Language Translation 201

    Learning to Represent Bilingual Dictionaries

    Full text link
    Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the literal word definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. Experimental evaluation focuses on two applications. The results of the cross-lingual reverse dictionary retrieval task show our model's promising ability of comprehending bilingual concepts based on descriptions, and highlight the effectiveness of proposed learning strategies in improving performance. Meanwhile, our model effectively addresses the bilingual paraphrase identification problem and significantly outperforms previous approaches.Comment: CoNLL 201

    Syntax-aware Data Augmentation for Neural Machine Translation

    Full text link
    Data augmentation is an effective performance enhancement in neural machine translation (NMT) by generating additional bilingual data. In this paper, we propose a novel data augmentation enhancement strategy for neural machine translation. Different from existing data augmentation methods which simply choose words with the same probability across different sentences for modification, we set sentence-specific probability for word selection by considering their roles in sentence. We use dependency parse tree of input sentence as an effective clue to determine selecting probability for every words in each sentence. Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset. The result of extensive experiments show our proposed syntax-aware data augmentation method may effectively boost existing sentence-independent methods for significant translation performance improvement

    A Comprehensive Survey of Grammar Error Correction

    Full text link
    Grammar error correction (GEC) is an important application aspect of natural language processing techniques. The past decade has witnessed significant progress achieved in GEC for the sake of increasing popularity of machine learning and deep learning, especially in late 2010s when near human-level GEC systems are available. However, there is no prior work focusing on the whole recapitulation of the progress. We present the first survey in GEC for a comprehensive retrospect of the literature in this area. We first give the introduction of five public datasets, data annotation schema, two important shared tasks and four standard evaluation metrics. More importantly, we discuss four kinds of basic approaches, including statistical machine translation based approach, neural machine translation based approach, classification based approach and language model based approach, six commonly applied performance boosting techniques for GEC systems and two data augmentation methods. Since GEC is typically viewed as a sister task of machine translation, many GEC systems are based on neural machine translation (NMT) approaches, where the neural sequence-to-sequence model is applied. Similarly, some performance boosting techniques are adapted from machine translation and are successfully combined with GEC systems for enhancement on the final performance. Furthermore, we conduct an analysis in level of basic approaches, performance boosting techniques and integrated GEC systems based on their experiment results respectively for more clear patterns and conclusions. Finally, we discuss five prospective directions for future GEC researches

    Iterative Batch Back-Translation for Neural Machine Translation: A Conceptual Model

    Full text link
    An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of back-translations of the target-side monolingual data. Recently, iterative back-translation has been shown to outperform standard back-translation albeit on some language pairs. This work proposes the iterative batch back-translation that is aimed at enhancing the standard iterative back-translation and enabling the efficient utilization of more monolingual data. After each iteration, improved back-translations of new sentences are added to the parallel data that will be used to train the final forward model. The work presents a conceptual model of the proposed approach.Comment: This article was a proposal, a conceptual model and, thereby, substantially overlapping with arXiv:1912.10514. This research has been substantially reworked. Some of the findings are presented in arXiv:1912.10514, arXiv:2006.02876 and arXiv:2011.07403. The final work will be submitted for publishing in due cours

    A Survey on Low-Resource Neural Machine Translation

    Full text link
    Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data. Thus, a lot of research has been conducted for neural machine translation (NMT) with very limited parallel data, i.e., the low-resource setting. In this paper, we provide a survey for low-resource NMT and classify related works into three categories according to the auxiliary data they used: (1) exploiting monolingual data of source and/or target languages, (2) exploiting data from auxiliary languages, and (3) exploiting multi-modal data. We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms, and help industry practitioners to choose appropriate algorithms for their applications.Comment: A short version has been submitted to IJCAI2021 Survey Track on Feb. 26th, 2021, accepted on Apr. 16th, 2021. 14 pages, 4 figure

    Towards Better Chinese-centric Neural Machine Translation for Low-resource Languages

    Full text link
    The last decade has witnessed enormous improvements in science and technology, stimulating the growing demand for economic and cultural exchanges in various countries. Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting. However, recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese. To achieve this, the low-resource multilingual translation challenge of the 2021 iFLYTEK AI Developer Competition provides the Chinese-centric multilingual low-resource NMT tasks, where participants are required to build NMT systems based on the provided low-resource samples. In this paper, we present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking. In addition, a new Incomplete-Trust (In-trust) loss function is proposed to replace the traditional cross-entropy loss when training. The experimental results demonstrate that the implementation of these ideas leads better performance than other state-of-the-art methods. All the experimental codes are released at: https://github.com/WENGSYX/Low-resource-text-translation.Comment: 7pages, 4 figures, 4 table

    Language Model-Driven Unsupervised Neural Machine Translation

    Full text link
    Unsupervised neural machine translation(NMT) is associated with noise and errors in synthetic data when executing vanilla back-translations. Here, we explicitly exploits language model(LM) to drive construction of an unsupervised NMT system. This features two steps. First, we initialize NMT models using synthetic data generated via temporary statistical machine translation(SMT). Second, unlike vanilla back-translation, we formulate a weight function, that scores synthetic data at each step of subsequent iterative training; this allows unsupervised training to an improved outcome. We present the detailed mathematical construction of our method. Experimental WMT2014 English-French, and WMT2016 English-German and English-Russian translation tasks revealed that our method outperforms the best prior systems by more than 3 BLEU points.Comment: 11 pages, 3 figures, 7 table

    Enhanced back-translation for low resource neural machine translation using self-training

    Full text link
    Improving neural machine translation (NMT) models using the back-translations of the monolingual target data (synthetic parallel data) is currently the state-of-the-art approach for training improved translation systems. The quality of the backward system - which is trained on the available parallel data and used for the back-translation - has been shown in many studies to affect the performance of the final NMT model. In low resource conditions, the available parallel data is usually not enough to train a backward model that can produce the qualitative synthetic data needed to train a standard translation model. This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique. The technique was shown to improve baseline low resource IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively. The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.Comment: 17 pages, 3 figures, 5 tables; Accepted for publication in the International Conference on Information and Communication Technology and Applications (ICTA 2020
    • …
    corecore