250 research outputs found
Achieving Human Parity on Automatic Chinese to English News Translation
Machine translation has made rapid advances in recent years. Millions of
people are using it today in online translation systems and mobile applications
in order to communicate across language barriers. The question naturally arises
whether such systems can approach or achieve parity with human translations. In
this paper, we first address the problem of how to define and accurately
measure human parity in translation. We then describe Microsoft's machine
translation system and measure the quality of its translations on the widely
used WMT 2017 news translation task from Chinese to English. We find that our
latest neural machine translation system has reached a new state-of-the-art,
and that the translation quality is at human parity when compared to
professional human translations. We also find that it significantly exceeds the
quality of crowd-sourced non-professional translations
Multi-Source Neural Machine Translation with Data Augmentation
Multi-source translation systems translate from multiple languages to a
single target language. By using information from these multiple sources, these
systems achieve large gains in accuracy. To train these systems, it is
necessary to have corpora with parallel text in multiple sources and the target
language. However, these corpora are rarely complete in practice due to the
difficulty of providing human translations in all of the relevant languages. In
this paper, we propose a data augmentation approach to fill such incomplete
parts using multi-source neural machine translation (NMT). In our experiments,
results varied over different language combinations but significant gains were
observed when using a source language similar to the target language.Comment: 15th International Workshop on Spoken Language Translation 201
Learning to Represent Bilingual Dictionaries
Bilingual word embeddings have been widely used to capture the similarity of
lexical semantics in different human languages. However, many applications,
such as cross-lingual semantic search and question answering, can be largely
benefited from the cross-lingual correspondence between sentences and lexicons.
To bridge this gap, we propose a neural embedding model that leverages
bilingual dictionaries. The proposed model is trained to map the literal word
definitions to the cross-lingual target words, for which we explore with
different sentence encoding techniques. To enhance the learning process on
limited resources, our model adopts several critical learning strategies,
including multi-task learning on different bridges of languages, and joint
learning of the dictionary model with a bilingual word embedding model.
Experimental evaluation focuses on two applications. The results of the
cross-lingual reverse dictionary retrieval task show our model's promising
ability of comprehending bilingual concepts based on descriptions, and
highlight the effectiveness of proposed learning strategies in improving
performance. Meanwhile, our model effectively addresses the bilingual
paraphrase identification problem and significantly outperforms previous
approaches.Comment: CoNLL 201
Syntax-aware Data Augmentation for Neural Machine Translation
Data augmentation is an effective performance enhancement in neural machine
translation (NMT) by generating additional bilingual data. In this paper, we
propose a novel data augmentation enhancement strategy for neural machine
translation. Different from existing data augmentation methods which simply
choose words with the same probability across different sentences for
modification, we set sentence-specific probability for word selection by
considering their roles in sentence. We use dependency parse tree of input
sentence as an effective clue to determine selecting probability for every
words in each sentence. Our proposed method is evaluated on WMT14
English-to-German dataset and IWSLT14 German-to-English dataset. The result of
extensive experiments show our proposed syntax-aware data augmentation method
may effectively boost existing sentence-independent methods for significant
translation performance improvement
A Comprehensive Survey of Grammar Error Correction
Grammar error correction (GEC) is an important application aspect of natural
language processing techniques. The past decade has witnessed significant
progress achieved in GEC for the sake of increasing popularity of machine
learning and deep learning, especially in late 2010s when near human-level GEC
systems are available. However, there is no prior work focusing on the whole
recapitulation of the progress. We present the first survey in GEC for a
comprehensive retrospect of the literature in this area. We first give the
introduction of five public datasets, data annotation schema, two important
shared tasks and four standard evaluation metrics. More importantly, we discuss
four kinds of basic approaches, including statistical machine translation based
approach, neural machine translation based approach, classification based
approach and language model based approach, six commonly applied performance
boosting techniques for GEC systems and two data augmentation methods. Since
GEC is typically viewed as a sister task of machine translation, many GEC
systems are based on neural machine translation (NMT) approaches, where the
neural sequence-to-sequence model is applied. Similarly, some performance
boosting techniques are adapted from machine translation and are successfully
combined with GEC systems for enhancement on the final performance.
Furthermore, we conduct an analysis in level of basic approaches, performance
boosting techniques and integrated GEC systems based on their experiment
results respectively for more clear patterns and conclusions. Finally, we
discuss five prospective directions for future GEC researches
Iterative Batch Back-Translation for Neural Machine Translation: A Conceptual Model
An effective method to generate a large number of parallel sentences for
training improved neural machine translation (NMT) systems is the use of
back-translations of the target-side monolingual data. Recently, iterative
back-translation has been shown to outperform standard back-translation albeit
on some language pairs. This work proposes the iterative batch back-translation
that is aimed at enhancing the standard iterative back-translation and enabling
the efficient utilization of more monolingual data. After each iteration,
improved back-translations of new sentences are added to the parallel data that
will be used to train the final forward model. The work presents a conceptual
model of the proposed approach.Comment: This article was a proposal, a conceptual model and, thereby,
substantially overlapping with arXiv:1912.10514. This research has been
substantially reworked. Some of the findings are presented in
arXiv:1912.10514, arXiv:2006.02876 and arXiv:2011.07403. The final work will
be submitted for publishing in due cours
A Survey on Low-Resource Neural Machine Translation
Neural approaches have achieved state-of-the-art accuracy on machine
translation but suffer from the high cost of collecting large scale parallel
data. Thus, a lot of research has been conducted for neural machine translation
(NMT) with very limited parallel data, i.e., the low-resource setting. In this
paper, we provide a survey for low-resource NMT and classify related works into
three categories according to the auxiliary data they used: (1) exploiting
monolingual data of source and/or target languages, (2) exploiting data from
auxiliary languages, and (3) exploiting multi-modal data. We hope that our
survey can help researchers to better understand this field and inspire them to
design better algorithms, and help industry practitioners to choose appropriate
algorithms for their applications.Comment: A short version has been submitted to IJCAI2021 Survey Track on Feb.
26th, 2021, accepted on Apr. 16th, 2021. 14 pages, 4 figure
Towards Better Chinese-centric Neural Machine Translation for Low-resource Languages
The last decade has witnessed enormous improvements in science and
technology, stimulating the growing demand for economic and cultural exchanges
in various countries. Building a neural machine translation (NMT) system has
become an urgent trend, especially in the low-resource setting. However, recent
work tends to study NMT systems for low-resource languages centered on English,
while few works focus on low-resource NMT systems centered on other languages
such as Chinese. To achieve this, the low-resource multilingual translation
challenge of the 2021 iFLYTEK AI Developer Competition provides the
Chinese-centric multilingual low-resource NMT tasks, where participants are
required to build NMT systems based on the provided low-resource samples. In
this paper, we present the winner competition system that leverages monolingual
word embeddings data enhancement, bilingual curriculum learning, and
contrastive re-ranking. In addition, a new Incomplete-Trust (In-trust) loss
function is proposed to replace the traditional cross-entropy loss when
training. The experimental results demonstrate that the implementation of these
ideas leads better performance than other state-of-the-art methods. All the
experimental codes are released at:
https://github.com/WENGSYX/Low-resource-text-translation.Comment: 7pages, 4 figures, 4 table
Language Model-Driven Unsupervised Neural Machine Translation
Unsupervised neural machine translation(NMT) is associated with noise and
errors in synthetic data when executing vanilla back-translations. Here, we
explicitly exploits language model(LM) to drive construction of an unsupervised
NMT system. This features two steps. First, we initialize NMT models using
synthetic data generated via temporary statistical machine translation(SMT).
Second, unlike vanilla back-translation, we formulate a weight function, that
scores synthetic data at each step of subsequent iterative training; this
allows unsupervised training to an improved outcome. We present the detailed
mathematical construction of our method. Experimental WMT2014 English-French,
and WMT2016 English-German and English-Russian translation tasks revealed that
our method outperforms the best prior systems by more than 3 BLEU points.Comment: 11 pages, 3 figures, 7 table
Enhanced back-translation for low resource neural machine translation using self-training
Improving neural machine translation (NMT) models using the back-translations
of the monolingual target data (synthetic parallel data) is currently the
state-of-the-art approach for training improved translation systems. The
quality of the backward system - which is trained on the available parallel
data and used for the back-translation - has been shown in many studies to
affect the performance of the final NMT model. In low resource conditions, the
available parallel data is usually not enough to train a backward model that
can produce the qualitative synthetic data needed to train a standard
translation model. This work proposes a self-training strategy where the output
of the backward model is used to improve the model itself through the forward
translation technique. The technique was shown to improve baseline low resource
IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation
models by 11.06 and 1.5 BLEUs respectively. The synthetic data generated by the
improved English-German backward model was used to train a forward model which
out-performed another forward model trained using standard back-translation by
2.7 BLEU.Comment: 17 pages, 3 figures, 5 tables; Accepted for publication in the
International Conference on Information and Communication Technology and
Applications (ICTA 2020
- …