10,228 research outputs found

    Exploring different representational units in English-to-Turkish statistical machine translation

    Get PDF
    We investigate different representational granularities for sub-lexical representation in statistical machine translation work from English to Turkish. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with โ€œsentencesโ€ comprising only the content words of the original training data to bias root word alignment, (iii) reranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (iv) using model iteration all provide a non-trivial improvement over a fully word-based baseline. Despite our very limited training data, we improve from 20.22 BLEU points for our simplest model to 25.08 BLEU points for an improvement of 4.86 points or 24% relative

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    ๋ฌธ๋งฅ ์ธ์‹๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋‹จ์œ„ ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ •๊ต๋ฏผ.The neural machine translation (NMT) has attracted great attention in recent years, as it has yielded state-of-the-art translation quality. Despite of their promising results, many current NMT systems are sentence-level; translating each sentence independently. This ignores contexts on text thus producing inadequate and inconsistent translations at the document-level. To overcome the shortcomings, the context-aware NMT (CNMT) has been proposed that takes contextual sentences as input. This dissertation proposes novel methods for improving the CNMT system and an application of CNMT. We first tackle the efficient modeling of multiple contextual sentences on CNMT encoder. For this purpose, we propose a hierarchical context encoder that encodes contextual sentences from token-level to sentence-level. This novel architecture enables the model to achieve state-of-the-art performance on translation quality while taking less computation time on training and translation than existing methods. Secondly, we investigate the training method for CNMT models, where most models rely on negative log-likelihood (NLL) that do not fully exploit contextual dependencies. To overcome the insufficiency, we introduce coreference-based contrastive learning for CNMT that generates contrastive examples from coreference chains between the source and target sentences. The proposed method improves pronoun resolution accuracy of CNMT models, as well as overall translation quality. Finally, we investigate an application of CNMT on dealing with Korean honorifics which depends on contextual information for generating adequate translations. For the English-Korean translation task, we propose to use CNMT models that capture crucial contextual information on the English source document and adopt a context-aware post-editing system for exploiting contexts on Korean target sentences, resulting in more consistent Korean honorific translations.์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ๊ธฐ๋ฒ•์€ ์ตœ๊ทผ ๋ฒˆ์—ญ ํ’ˆ์งˆ์— ์žˆ์–ด์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃฉํ•˜์—ฌ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์€ ํ…์ŠคํŠธ๋ฅผ ๋…๋ฆฝ๋œ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ…์ŠคํŠธ์— ์กด์žฌํ•˜๋Š” ๋ฌธ๋งฅ์„ ๋ฌด์‹œํ•˜๊ณ  ๊ฒฐ๊ตญ ๋ฌธ์„œ ๋‹จ์œ„๋กœ ํŒŒ์•…ํ–ˆ์„ ๋•Œ ์ ์ ˆํ•˜์ง€ ์•Š์€ ๋ฒˆ์—ญ๋ฌธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€ ๋ฌธ์žฅ์„ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•๋“ค๊ณผ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ € ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํ† ํฐ ๋ ˆ๋ฒจ ๋ฐ ๋ฌธ์žฅ ๋ ˆ๋ฒจ๋กœ ๋‹จ๊ณ„์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ณ„์ธต์  ๋ฌธ๋งฅ ์ธ์ฝ”๋”๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ์ œ์‹œ๋œ ๋ชจ๋ธ์€ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ์ข‹์€ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ์–ป์œผ๋ฉด์„œ ๋™์‹œ์— ํ•™์Šต ๋ฐ ๋ฒˆ์—ญ์— ๊ฑธ๋ฆฌ๋Š” ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•˜์˜€๋Š”๋ฐ ์ด๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฌธ๋งฅ์— ๋Œ€ํ•œ ์˜์กด ๊ด€๊ณ„๋ฅผ ์ „๋ถ€ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” ์ „ํ†ต์ ์ธ ์Œ์˜ ๋กœ๊ทธ์šฐ๋„ ์†์‹คํ•จ์ˆ˜์— ์˜์กดํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์„ ์œ„ํ•œ ์ƒํ˜ธ์ฐธ์กฐ์— ๊ธฐ๋ฐ˜ํ•œ ๋Œ€์กฐํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ œ์‹œ๋œ ๊ธฐ๋ฒ•์€ ์›๋ฌธ๊ณผ ์ฃผ๋ณ€ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค ์‚ฌ์ด์— ์กด์žฌํ•˜๋Š” ์ƒํ˜ธ์ฐธ์กฐ ์‚ฌ์Šฌ์„ ํ™œ์šฉํ•˜์—ฌ ๋Œ€์กฐ ์‚ฌ๋ก€๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๋ชจ๋ธ๋“ค์˜ ์ „๋ฐ˜์ ์ธ ๋ฒˆ์—ญ ํ’ˆ์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋Œ€๋ช…์‚ฌ ํ•ด๊ฒฐ ์„ฑ๋Šฅ๋„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ๋งฅ๋ฝ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ๋ฒˆ์—ญ์— ์žˆ์–ด์„œ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด์„œ๋„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ์ด์— ์˜์–ด-ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฌธ์ œ์— ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์˜์–ด ์›๋ฌธ์—์„œ ํ•„์ˆ˜์ ์ธ ๋งฅ๋ฝ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•œํŽธ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ๋ฌธ์—์„œ๋„ ๋ฌธ๋งฅ ์ธ์‹ ์‚ฌํ›„ํŽธ์ง‘ ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ผ๊ด€๋œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ํ‘œํ˜„์„ ๋ฒˆ์—ญํ•˜๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค.Abstract i Contents ii List of Tables vi List of Figures viii 1 Introduction 1 2 Background: Neural Machine Translation 7 2.1 A Brief History 7 2.2 Problem Setup 9 2.3 Encoder-Decoder architectures 10 2.3.1 RNN-based Architecture 11 2.3.2 SAN-based Architecture 13 2.4 Training 16 2.5 Decoding 16 2.6 Evaluation 17 3 Efficient Hierarchical Architecture for Modeling Contextual Sentences 18 3.1 Related works 20 3.1.1 Modeling Context in NMT 20 3.1.2 Hierarchical Context Modeling 21 3.1.3 Evaluation of Context-aware NMT 21 3.2 Model description 22 3.2.1 Context-aware NMT encoders 22 3.2.2 Hierarchical context encoder 27 3.3 Data 28 3.3.1 English-German IWSLT 2017 corpus 29 3.3.2 OpenSubtitles corpus 29 3.3.3 English-Korean subtitle corpus 31 3.4 Experiments 31 3.4.1 Hyperparameters and Training details 31 3.4.2 Overall BLEU evaluation 32 3.4.3 Model complexity analysis 32 3.4.4 BLEU evaluation on helpful/unhelpful context 34 3.4.5 EnKo pronoun resolution test suite 35 3.4.6 Qualitative Analysis 37 3.5 Summary of Efficient Hierarchical Architecture for Modeling Contextual Sentences 43 4 Contrastive Learning for Context-aware Neural Machine Translation 44 4.1 Related Works 46 4.1.1 Context-aware NMT Architectures 46 4.1.2 Coreference and NMT 47 4.1.3 Data augmentation for NMT 47 4.1.4 Contrastive Learning 47 4.2 Context-aware NMT models 48 4.3 Our Method: CorefCL 50 4.3.1 Data Augmentation Using Coreference 50 4.3.2 Contrastive Learning for Context-aware NMT 52 4.4 Experiments 53 4.4.1 Datasets 53 4.4.2 Settings 54 4.4.3 Overall BLEU Evaluation 55 4.4.4 Results on English-German Contrastive Evaluation Set 57 4.4.5 Analysis 58 4.5 Summary of Contrastive Learning for Context-aware Neural Machine Translation 59 5 Improving English-Korean Honorific Translation Using Contextual Information 60 5.1 Related Works 63 5.1.1 Neural Machine Translation dealing with Korean 63 5.1.2 Controlling the Styles in NMT 63 5.1.3 Context-Aware NMT Framework and Application 64 5.2 Addressing Korean Honorifics in Context 65 5.2.1 Overview of Korean Honorifics System 65 5.2.2 The Role of Context on Choosing Honorifics 68 5.3 Context-Aware NMT Frameworks 69 5.3.1 NMT Model with Contextual Encoders 71 5.3.2 Context-Aware Post Editing (CAPE) 71 5.4 Our Proposed Method - Context-Aware NMT for Korean Honorifics 73 5.4.1 Using CNMT methods for Honorific-Aware Translation 74 5.4.2 Scope of Honorific Expressions 75 5.4.3 Automatic Honorific Labeling 76 5.5 Experiments 77 5.5.1 Dataset and Preprocessing 77 5.5.2 Model Implementation and Training Details 80 5.5.3 Metrics 80 5.5.4 Results 81 5.5.5 Translation Examples and Analysis 86 5.6 Summary of Improving English-Korean Honorific Translation Using Contextual Information 89 6 Future Directions 91 6.1 Document-level Datasets 91 6.2 Document-level Evaluation 92 6.3 Bias and Fairness of Document-level NMT 93 6.4 Towards Practical Applications 94 7 Conclusions 96 Abstract (In Korean) 117 Acknowledgment 119๋ฐ•

    Towards a better integration of fuzzy matches in neural machine translation through data augmentation

    Get PDF
    We identify a number of aspects that can boost the performance of Neural Fuzzy Repair (NFR), an easy-to-implement method to integrate translation memory matches and neural machine translation (NMT). We explore various ways of maximising the added value of retrieved matches within the NFR paradigm for eight language combinations, using Transformer NMT systems. In particular, we test the impact of different fuzzy matching techniques, sub-word-level segmentation methods and alignment-based features on overall translation quality. Furthermore, we propose a fuzzy match combination technique that aims to maximise the coverage of source words. This is supplemented with an analysis of how translation quality is affected by input sentence length and fuzzy match score. The results show that applying a combination of the tested modifications leads to a significant increase in estimated translation quality over all baselines for all language combinations
    • โ€ฆ
    corecore