Search CORE

10 research outputs found

Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation

Author: Fadaee Marzieh
Monz Christof
Publication venue
Publication date: 01/01/2018
Field of study

Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this work, we explore different aspects of back-translation, and show that words with high prediction loss during training benefit most from the addition of synthetic data. We introduce several variations of sampling strategies targeting difficult-to-predict words using prediction losses and frequencies of words. In addition, we also target the contexts of difficult words and sample sentences that are similar in context. Experimental results for the WMT news translation task show that our method improves translation quality by up to 1.7 and 1.2 Bleu points over back-translation using random sampling for German-English and English-German, respectively.Comment: 11 pages, 2 figures. Accepted at EMNLP 201

arXiv.org e-Print Archive

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Comparison of Data Selection Techniques for the Translation of Video Lectures

Author: Wuebker Joern
Ney Hermann
Martínez-Villaronga Adrià
Giménez Pastor Adrián
Juan Císcar Alfonso
Servan Christophe
Dymetman Marc
Mirkin Shashar
Publication venue: Association for Machine Translation in the Americas
Publication date: 22/10/2014
Field of study

[EN] For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martínez-Villaronga, A.; Giménez Pastor, A.; Juan Císcar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431

RiuNet

Institutional Repositories DataBase (IRDB)

합성 병렬데이터를 활용한 인공신경망 기계번역 시스템 구축

Author: 박재홍
Publication venue: 서울대학교 대학원
Publication date: 01/08/2017
Field of study

학위논문 (석사)-- 서울대학교 대학원 공과대학 전기·정보공학부, 2017. 8. 윤성로.학습된 번역 모델에 의해 생성 가능한 합성 병렬데이터는 최근 인공신경망 기계번역에서 발생하는 다양한 이슈에 효과적인 해결책으로 대두되었다. 이러한 합성 병렬데이터의 효용에 착안하여 본 연구에서는 합성 병렬데이터만을 활용하여 인공신경망 기계번역 시스템을 구축한다. 더불어 본 연구에서는 실제 병렬 데이터의 효과적인 대안이 될 수 있는 새로운 유형의 합성 병렬데이터를 제시한다. 본 연구에서 제안하는 합성 병렬데이터는 실제 문장과 합성된 문장이 병렬 문장 쌍의 양쪽에 혼재되어 있다는 점에서 기존에 제시됐던 합성 병렬데이터와 차별성을 갖는다. 동일한 조건에서 본 연구가 제안하는 합성 병렬데이터로 인공신경망 기계번역 시스템을 학습한 결과, 기존에 제시됐던 합성 병렬데이터로 학습한 경우에 비해 양방향 번역에서 보다 우수하고 안정적인 번역 성능을 나타냈다. 또한 새로운 합성 병렬데이터로 학습한 인공신경망 번역 모델을 실제 병렬데이터로 fine-tuning 할 경우, 기존에 제시된 합성 병렬데이터에 비해 상대적으로 높은 번역 성능의 향상을 확인할 수 있었다.Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. We also present a novel synthetic parallel corpus as an efficient alternative to real parallel data. The proposed pseudo parallel data are distinct from those of previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus in empirical NMT applications, which not only shows enhanced results for bidirectional translation tasks, but also substantial improvement with the aid of a ground truth parallel corpus.Table of Contents Ⅰ. Introduction 1 Ⅱ. Background: Neural Machine Translation 4 Ⅲ. Related Work 9 Ⅳ. Synthetic Parallel Data as an Alternative to Real Parallel Corpus 11 4.1. Motivation 11 4.2. Limits of the Previous Approaches 11 4.3. Proposed Mixing Approach 14 Ⅴ. Experiments: Effects of Mixing Real and Synthetic Examples 17 5.1. Data Preparation 18 5.2. Data Preprocessing 19 5.3. Training and Evaluation 19 5.4. Results and Analysis 20 5.4.1. A Comparison between Pivot-based Approach and Back-translation 20 5.4.2. Effects of Mixing Source- and Target-originated Synthetic Parallel Data 21 5.4.3. A Comparison with Phrase-based Statistical Machine Translation 23 Ⅵ. Experiments: Large-scale Application 25 6.1. Application Scenarios 25 6.2. Data Preparation 26 6.3. Training and Evaluation 27 6.4. Results and Analysis 31 6.4.1. A Comparison with Real Parallel Data 31 6.4.2. Results from the Pseudo Only Scenario 31 6.4.3. Results from the Real Fine-tuning Scenario 33 Ⅶ. Conclusion 35 Bibliography 36 Abstract 43Maste

SNU Open Repository and Archive

Understanding and Enhancing the Use of Context for Machine Translation

Author: Fadaee Marzieh
Publication venue
Publication date: 01/01/2020
Field of study

To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE