10 research outputs found
Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation
Neural Machine Translation has achieved state-of-the-art performance for
several language pairs using a combination of parallel and synthetic data.
Synthetic data is often generated by back-translating sentences randomly
sampled from monolingual data using a reverse translation model. While
back-translation has been shown to be very effective in many cases, it is not
entirely clear why. In this work, we explore different aspects of
back-translation, and show that words with high prediction loss during training
benefit most from the addition of synthetic data. We introduce several
variations of sampling strategies targeting difficult-to-predict words using
prediction losses and frequencies of words. In addition, we also target the
contexts of difficult words and sample sentences that are similar in context.
Experimental results for the WMT news translation task show that our method
improves translation quality by up to 1.7 and 1.2 Bleu points over
back-translation using random sampling for German-English and English-German,
respectively.Comment: 11 pages, 2 figures. Accepted at EMNLP 201
Comparison of Data Selection Techniques for the Translation of Video Lectures
[EN] For the task of online translation of scientific video lectures, using huge models is not possible.
In order to get smaller and efficient models, we perform data selection. In this paper, we
perform a qualitative and quantitative comparison of several data selection techniques, based
on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation
and language model cross-entropy achieves the most stable results. As another important
criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary
words. Here, infrequent n-gram recovery shows superior performance. Finally, we
combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martรญnez-Villaronga, A.; Gimรฉnez Pastor, A.; Juan Cรญscar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431
ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๊ณ๋ฒ์ญ ์์คํ ๊ตฌ์ถ
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2017. 8. ์ค์ฑ๋ก.ํ์ต๋ ๋ฒ์ญ ๋ชจ๋ธ์ ์ํด ์์ฑ ๊ฐ๋ฅํ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ ์ต๊ทผ ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๊ณ๋ฒ์ญ์์ ๋ฐ์ํ๋ ๋ค์ํ ์ด์์ ํจ๊ณผ์ ์ธ ํด๊ฒฐ์ฑ
์ผ๋ก ๋๋๋์๋ค. ์ด๋ฌํ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ์ ํจ์ฉ์ ์ฐฉ์ํ์ฌ ๋ณธ ์ฐ๊ตฌ์์๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ง์ ํ์ฉํ์ฌ ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๊ณ๋ฒ์ญ ์์คํ
์ ๊ตฌ์ถํ๋ค. ๋๋ถ์ด ๋ณธ ์ฐ๊ตฌ์์๋ ์ค์ ๋ณ๋ ฌ ๋ฐ์ดํฐ์ ํจ๊ณผ์ ์ธ ๋์์ด ๋ ์ ์๋ ์๋ก์ด ์ ํ์ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ฅผ ์ ์ํ๋ค. ๋ณธ ์ฐ๊ตฌ์์ ์ ์ํ๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ ์ค์ ๋ฌธ์ฅ๊ณผ ํฉ์ฑ๋ ๋ฌธ์ฅ์ด ๋ณ๋ ฌ ๋ฌธ์ฅ ์์ ์์ชฝ์ ํผ์ฌ๋์ด ์๋ค๋ ์ ์์ ๊ธฐ์กด์ ์ ์๋๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ์ ์ฐจ๋ณ์ฑ์ ๊ฐ๋๋ค. ๋์ผํ ์กฐ๊ฑด์์ ๋ณธ ์ฐ๊ตฌ๊ฐ ์ ์ํ๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ก ์ธ๊ณต์ ๊ฒฝ๋ง ๊ธฐ๊ณ๋ฒ์ญ ์์คํ
์ ํ์ตํ ๊ฒฐ๊ณผ, ๊ธฐ์กด์ ์ ์๋๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ก ํ์ตํ ๊ฒฝ์ฐ์ ๋นํด ์๋ฐฉํฅ ๋ฒ์ญ์์ ๋ณด๋ค ์ฐ์ํ๊ณ ์์ ์ ์ธ ๋ฒ์ญ ์ฑ๋ฅ์ ๋ํ๋๋ค. ๋ํ ์๋ก์ด ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ก ํ์ตํ ์ธ๊ณต์ ๊ฒฝ๋ง ๋ฒ์ญ ๋ชจ๋ธ์ ์ค์ ๋ณ๋ ฌ๋ฐ์ดํฐ๋ก fine-tuning ํ ๊ฒฝ์ฐ, ๊ธฐ์กด์ ์ ์๋ ํฉ์ฑ ๋ณ๋ ฌ๋ฐ์ดํฐ์ ๋นํด ์๋์ ์ผ๋ก ๋์ ๋ฒ์ญ ์ฑ๋ฅ์ ํฅ์์ ํ์ธํ ์ ์์๋ค.Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. We also present a novel synthetic parallel corpus as an efficient alternative to real parallel data. The proposed pseudo parallel data are distinct from those of previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus in empirical NMT applications, which not only shows enhanced results for bidirectional translation tasks, but also substantial improvement with the aid of a ground truth parallel corpus.Table of Contents
โ
. Introduction 1
โ
ก. Background: Neural Machine Translation 4
โ
ข. Related Work 9
โ
ฃ. Synthetic Parallel Data as an Alternative to Real Parallel Corpus 11
4.1. Motivation 11
4.2. Limits of the Previous Approaches 11
4.3. Proposed Mixing Approach 14
โ
ค. Experiments: Effects of Mixing Real and Synthetic Examples 17
5.1. Data Preparation 18
5.2. Data Preprocessing 19
5.3. Training and Evaluation 19
5.4. Results and Analysis 20
5.4.1. A Comparison between Pivot-based Approach and Back-translation 20
5.4.2. Effects of Mixing Source- and Target-originated Synthetic Parallel Data 21
5.4.3. A Comparison with Phrase-based Statistical Machine Translation 23
โ
ฅ. Experiments: Large-scale Application 25
6.1. Application Scenarios 25
6.2. Data Preparation 26
6.3. Training and Evaluation 27
6.4. Results and Analysis 31
6.4.1. A Comparison with Real Parallel Data 31
6.4.2. Results from the Pseudo Only Scenario 31
6.4.3. Results from the Real Fine-tuning Scenario 33
โ
ฆ. Conclusion 35
Bibliography 36
Abstract 43Maste
Understanding and Enhancing the Use of Context for Machine Translation
To understand and infer meaning in language, neural models have to learn
complicated nuances. Discovering distinctive linguistic phenomena from data is
not an easy task. For instance, lexical ambiguity is a fundamental feature of
language which is challenging to learn. Even more prominently, inferring the
meaning of rare and unseen lexical units is difficult with neural networks.
Meaning is often determined from context. With context, languages allow meaning
to be conveyed even when the specific words used are not known by the reader.
To model this learning process, a system has to learn from a few instances in
context and be able to generalize well to unseen cases. The learning process is
hindered when training data is scarce for a task. Even with sufficient data,
learning patterns for the long tail of the lexical distribution is challenging.
In this thesis, we focus on understanding certain potentials of contexts in
neural models and design augmentation models to benefit from them. We focus on
machine translation as an important instance of the more general language
understanding problem. To translate from a source language to a target
language, a neural model has to understand the meaning of constituents in the
provided context and generate constituents with the same meanings in the target
language. This task accentuates the value of capturing nuances of language and
the necessity of generalization from few observations. The main problem we
study in this thesis is what neural machine translation models learn from data
and how we can devise more focused contexts to enhance this learning. Looking
more in-depth into the role of context and the impact of data on learning
models is essential to advance the NLP field. Moreover, it helps highlight the
vulnerabilities of current neural networks and provides insights into designing
more robust models.Comment: PhD dissertation defended on November 10th, 202