10 research outputs found

    Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation

    Get PDF
    Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this work, we explore different aspects of back-translation, and show that words with high prediction loss during training benefit most from the addition of synthetic data. We introduce several variations of sampling strategies targeting difficult-to-predict words using prediction losses and frequencies of words. In addition, we also target the contexts of difficult words and sample sentences that are similar in context. Experimental results for the WMT news translation task show that our method improves translation quality by up to 1.7 and 1.2 Bleu points over back-translation using random sampling for German-English and English-German, respectively.Comment: 11 pages, 2 figures. Accepted at EMNLP 201

    Comparison of Data Selection Techniques for the Translation of Video Lectures

    Full text link
    [EN] For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martรญnez-Villaronga, A.; Gimรฉnez Pastor, A.; Juan Cรญscar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431

    ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ ๊ตฌ์ถ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2017. 8. ์œค์„ฑ๋กœ.ํ•™์Šต๋œ ๋ฒˆ์—ญ ๋ชจ๋ธ์— ์˜ํ•ด ์ƒ์„ฑ ๊ฐ€๋Šฅํ•œ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋Š” ์ตœ๊ทผ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ด์Šˆ์— ํšจ๊ณผ์ ์ธ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ๋Œ€๋‘๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์˜ ํšจ์šฉ์— ์ฐฉ์•ˆํ•˜์—ฌ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋งŒ์„ ํ™œ์šฉํ•˜์—ฌ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•œ๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์‹ค์ œ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์˜ ํšจ๊ณผ์ ์ธ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์œ ํ˜•์˜ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•˜๋Š” ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋Š” ์‹ค์ œ ๋ฌธ์žฅ๊ณผ ํ•ฉ์„ฑ๋œ ๋ฌธ์žฅ์ด ๋ณ‘๋ ฌ ๋ฌธ์žฅ ์Œ์˜ ์–‘์ชฝ์— ํ˜ผ์žฌ๋˜์–ด ์žˆ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด์— ์ œ์‹œ๋๋˜ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์™€ ์ฐจ๋ณ„์„ฑ์„ ๊ฐ–๋Š”๋‹ค. ๋™์ผํ•œ ์กฐ๊ฑด์—์„œ ๋ณธ ์—ฐ๊ตฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋กœ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์„ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ, ๊ธฐ์กด์— ์ œ์‹œ๋๋˜ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๊ฒฝ์šฐ์— ๋น„ํ•ด ์–‘๋ฐฉํ–ฅ ๋ฒˆ์—ญ์—์„œ ๋ณด๋‹ค ์šฐ์ˆ˜ํ•˜๊ณ  ์•ˆ์ •์ ์ธ ๋ฒˆ์—ญ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋˜ํ•œ ์ƒˆ๋กœ์šด ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๋ชจ๋ธ์„ ์‹ค์ œ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ๋กœ fine-tuning ํ•  ๊ฒฝ์šฐ, ๊ธฐ์กด์— ์ œ์‹œ๋œ ํ•ฉ์„ฑ ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ๋ฒˆ์—ญ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. We also present a novel synthetic parallel corpus as an efficient alternative to real parallel data. The proposed pseudo parallel data are distinct from those of previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus in empirical NMT applications, which not only shows enhanced results for bidirectional translation tasks, but also substantial improvement with the aid of a ground truth parallel corpus.Table of Contents โ… . Introduction 1 โ…ก. Background: Neural Machine Translation 4 โ…ข. Related Work 9 โ…ฃ. Synthetic Parallel Data as an Alternative to Real Parallel Corpus 11 4.1. Motivation 11 4.2. Limits of the Previous Approaches 11 4.3. Proposed Mixing Approach 14 โ…ค. Experiments: Effects of Mixing Real and Synthetic Examples 17 5.1. Data Preparation 18 5.2. Data Preprocessing 19 5.3. Training and Evaluation 19 5.4. Results and Analysis 20 5.4.1. A Comparison between Pivot-based Approach and Back-translation 20 5.4.2. Effects of Mixing Source- and Target-originated Synthetic Parallel Data 21 5.4.3. A Comparison with Phrase-based Statistical Machine Translation 23 โ…ฅ. Experiments: Large-scale Application 25 6.1. Application Scenarios 25 6.2. Data Preparation 26 6.3. Training and Evaluation 27 6.4. Results and Analysis 31 6.4.1. A Comparison with Real Parallel Data 31 6.4.2. Results from the Pseudo Only Scenario 31 6.4.3. Results from the Real Fine-tuning Scenario 33 โ…ฆ. Conclusion 35 Bibliography 36 Abstract 43Maste

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202
    corecore