5,172 research outputs found

    Transformer-based NMT : modeling, training and implementation

    Get PDF
    International trade and industrial collaborations enable countries and regions to concentrate their developments on specific industries while making the most of other countries' specializations, which significantly accelerates global development. However, globalization also increases the demand for cross-region communication. Language barriers between many languages worldwide create a challenge for achieving deep collaboration between groups speaking different languages, increasing the need for translation. Language technology, specifically, Machine Translation (MT) holds the promise to enable communication between languages efficiently in real-time with minimal costs. Even though nowadays computers can perform computation in parallel very fast, which provides machine translation users with translations with very low latency, and although the evolution from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) with the utilization of advanced deep learning algorithms has significantly boosted translation quality, current machine translation algorithms are still far from accurately translating all input. Thus, how to further improve the performance of state-of-the-art NMT algorithm remains a valuable open research question which has received a wide range of attention. In the research presented in this thesis, we first investigate the long-distance relation modeling ability of the state-of-the-art NMT model, the Transformer. We propose to learn source phrase representations and incorporate them into the Transformer translation model, aiming to enhance its ability to capture long-distance dependencies well. Second, though previous work (Bapna et al., 2018) suggests that deep Transformers have difficulty in converging, we empirically find that the convergence of deep Transformers depends on the interaction between the layer normalization and residual connections employed to stabilize its training. We conduct a theoretical study about how to ensure the convergence of Transformers, especially for deep Transformers, and propose to ensure the convergence of deep Transformers by putting the Lipschitz constraint on its parameter initialization. Finally, we investigate how to dynamically determine proper and efficient batch sizes during the training of the Transformer model. We find that the gradient direction gets stabilized with increasing batch size during gradient accumulation. Thus we propose to dynamically adjust batch sizes during training by monitoring the gradient direction change within gradient accumulation, and to achieve a proper and efficient batch size by stopping the gradient accumulation when the gradient direction starts to fluctuate. For our research in this thesis, we also implement our own NMT toolkit, the Neutron implementation of the Transformer and its variants. In addition to providing fundamental features as the basis of our implementations for the approaches presented in this thesis, we support many advanced features from recent cutting-edge research work. Implementations of all our approaches in this thesis are also included and open-sourced in the toolkit. To compare with previous approaches, we mainly conducted our experiments on the data from the WMT 14 English to German (En-De) and English to French (En-Fr) news translation tasks, except when studying the convergence of deep Transformers, where we alternated the WMT 14 En-Fr task with the WMT 15 Czech to English (Cs-En) news translation task to compare with Bapna et al. (2018). The sizes of these datasets vary from medium (the WMT 14 En-De, ~ 4.5M sentence pairs) to very large (the WMT 14 En-Fr, ~ 36M sentence pairs), thus we suggest our approaches help improve the translation quality between popular language pairs which are widely used and have sufficient data.China Scholarship Counci

    ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ฌธ๋งฅ ์ •๋ณด ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜๋Š” ๊ณ„์ธต์  ๋ฌธ๋งฅ ์ธ์ฝ”๋”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ์ •๊ต๋ฏผ.์ตœ๊ทผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)๋ฅผ ์œ„ํ•œ ํ‘œ์ค€ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๋กœ ๋ฐœ์ „ํ–ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๋Š” ํ† ํฐ ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ๊ฐ•์ ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์ถ”์ถœํ•œ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์ ์ ˆํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋Š” attention layer๋“ค๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐœ์ „์€ ์ตœ๊ทผ ๋”ฅ ๋Ÿฌ๋‹ ์‚ฌํšŒ์— ์ฃผ์–ด์ง„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๋ฐ–์˜ ์ถ”๊ฐ€ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋„์ „์„ ์ œ์‹œํ–ˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž‘์—…์—์„œ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์™ธ์— ์ถ”๊ฐ€์ ์ธ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๊ณผ ๋ถ„์„์„ attention layer์— ์ดˆ์ ์„ ๋งž์ถ”์–ด ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ €, ์ด์ „ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋‚ด์žฅํ•˜๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๋‚ด์žฅ๋œ ๋ฌธ๋งฅ ํ‘œํ˜„์„ ์ž…๋ ฅ ํ‘œํ˜„์— ์œตํ•ฉํ•˜๋Š” ๊ณ„์ธต์  ๋ฉ”๋ชจ๋ฆฌ ์ปจํ…์ŠคํŠธ ์ธ์ฝ”๋”(HMCE)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ HMCE๋Š” ๋‹ค์–‘ํ•œ ๋ฌธ๋งฅ ์ธ์ง€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์ž‘์—…์—์„œ ์ถ”๊ฐ€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์™€ ๋น„๊ตํ•˜์˜€์„ ๋•Œ ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ฌธ๋งฅ ํ‘œํ˜„๊ณผ ์ž…๋ ฅ ํ‘œํ˜„ ์‚ฌ์ด์˜ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ํ‘œํ˜„๊ณผ ์ž…๋ ฅ ํ‘œํ˜„ ์‚ฌ์ด์˜ ํ‘œํ˜„ ์œ ์‚ฌ์„ฑ์„ Centered Kernel Alignment(CKA)๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹ฌ์ธต ๋ถ„์„ํ•˜๋ฉฐ, CKA๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ์‹œ๊ฐ ์–‘์‹์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•ด CKA ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ •๋ ฌ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅํ•œ๋‹ค. ์ด Modality Alignment ๋ฐฉ๋ฒ•์€ ๋ฉ€ํ‹ฐ ๋ชจ๋‹ฌ๊ฐ„ ํ‘œํ˜„ ์œ ์‚ฌ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ๋น„๋””์˜ค ์งˆ๋ฌธ ์‘๋‹ต ์ž‘์—…์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜จ๋‹ค.Recently, the standard architecture for Natural Language Processing (NLP) has evolved from Recurrent Neural Network to Transformer architecture. Transformer architecture consists of attention layers which show its strength at finding the correlation between tokens and incorporate the correlation information to generate proper output. While many researches leveraging Transformer architecture report the new state-of-the-arts performances on various NLP tasks, These recent improvements propose a new challenge to deep learning society: exploiting additional context information. Because human intelligence perceives signals in everyday life with much rich contextual information (e.g. additional memory, visual information, and common sense), exploiting the context information is a step forward to the ultimate goal for Artificial Intelligence. In this dissertation, I propose novel methodologies and analyses to improve context-awareness of Transformer architecture focusing on the attention mechanism for various natural language processing tasks. The proposed methods utilize the additionally given context information, which is not limited to the modality of natural language, aside the given input information. First, I propose Hierarchical Memory Context Encoder (HMCE) which efficiently embeds the contextual information over preceding sentences via a hierarchical architecture of Transformer and fuses the embedded context representation into the input representation via memory attention mechanism. The proposed HMCE outperforms the original Transformer which does not leverage the additional context information on various context-aware machine translation tasks. It also shows the best performance evaluated in BLEU among the baselines using the additional context. Then, to improve the attention mechanism between context representation and input representation, I deeply analyze the representational similarity between the context representation and the input representation. Based on my analyses on representational similarity inside Transformer architecture, I propose a method for optimizing Centered Kernel Alignment (CKA) between internal representations of Transformer. The proposed CKA optimization method increases the performance of Transformer in various machine translation tasks and language modelling tasks. Lastly, I extend the CKA optimization method to Modality Alignment method for multi-modal scenarios where the context information takes the modality of visual information. My Modality Alignment method enhances the cross-modality attention mechanism by maximizing the representational similarity between visual representation and natural language representation, resulting in performance improvements larger than 3.5% accuracy on video question answering tasks.1 Introduction 1 2 Backgrounds 8 3 Context-aware Hierarchical Transformer Architecture 12 3.1 Related Works 15 3.1.1 Using Multiple Sentences for Context-awareness in Machine Translation 15 3.1.2 Structured Neural Machine Translation Models for Contextawareness 16 3.1.3 Evaluating Context-awareness with Generated Translation 16 3.2 Proposed Approach: Context-aware Hierarchical Text Encoder with Memory Networks 16 3.2.1 Context-aware NMT Encoders 17 3.2.2 Hierarchical Memory Context Encoder 21 3.3 Experiments 25 3.3.1 Data 26 3.3.2 Hyperparameters and Training Details 28 3.3.3 Overall BLEU Evaluation 28 3.3.4 Model Complexity Analysis 30 3.3.5 BLEU Evaluation on Helpful/Unhelpful Context 31 3.3.6 Qualitative Analysis 32 3.3.7 Limitations and Future Directions 34 3.4 Conclusion 35 4 Optimizing Representational Diversity of Transformer Architecture 36 4.1 Related Works 38 4.1.1 Analyses of Diversity in Multi-Head Attention 38 4.1.2 Similarities between Deep Neural Representations 39 4.2 Similarity Measures for Multi-Head Attention 40 4.2.1 Multi-Head Attention 40 4.2.2 Singular Vector Canonical Correlation Analysis (SVCCA) 41 4.2.3 Centered Kernel Alignment (CKA) 43 4.3 Proposed Approach: Controlling Inter-Head Diversity 43 4.3.1 HSIC Regularizer 44 4.3.2 Orthogonality Regularizer 44 4.3.3 Drophead 45 4.4 Inter-Head Similarity Analyses 46 4.4.1 Experimental Details for Similarity Analysis 46 4.4.2 Applying SVCCA and CKA 47 4.4.3 Analyses on Inter-Model Similarity 47 4.4.4 Does Multi-Head Strategy Diversify a Model's Representation Subspaces 49 4.5 Experiments on Controlling Inter-Head Similarity Methods 52 4.5.1 Experimental Details 52 4.5.2 Analysis on Controlling Inter-Head Diversity 54 4.5.3 Quantitative Evaluation 55 4.5.4 Limitations and Future Directions 57 4.6 Conclusions 60 5 Modality Alignment for Cross-modal Attention 61 5.1 Related Works 63 5.1.1 Representation Similarity between Modalities 63 5.1.2 Video Question Answering 64 5.2 Proposed Approach: Modality Align between Multi-modal Representations 65 5.2.1 Centered Kernel Alignment Review 65 5.2.2 Why CKA is Proper to Modality Alignment 66 5.2.3 Proposed Method 69 5.3 Experiments 71 5.3.1 Cosine Similarity Learning with CKA 72 5.3.2 Modality Align on Video Question Answering Task 75 5.4 Conclusion 82 6 Conclusion 83 Abstract (In Korean) 97๋ฐ•

    An empirical analysis of phrase-based and neural machine translation

    Get PDF
    Two popular types of machine translation (MT) are phrase-based and neural machine translation systems. Both of these types of systems are composed of multiple complex models or layers. Each of these models and layers learns different linguistic aspects of the source language. However, for some of these models and layers, it is not clear which linguistic phenomena are learned or how this information is learned. For phrase-based MT systems, it is often clear what information is learned by each model, and the question is rather how this information is learned, especially for its phrase reordering model. For neural machine translation systems, the situation is even more complex, since for many cases it is not exactly clear what information is learned and how it is learned. To shed light on what linguistic phenomena are captured by MT systems, we analyze the behavior of important models in both phrase-based and neural MT systems. We consider phrase reordering models from phrase-based MT systems to investigate which words from inside of a phrase have the biggest impact on defining the phrase reordering behavior. Additionally, to contribute to the interpretability of neural MT systems we study the behavior of the attention model, which is a key component in neural MT systems and the closest model in functionality to phrase reordering models in phrase-based systems. The attention model together with the encoder hidden state representations form the main components to encode source side linguistic information in neural MT. To this end, we also analyze the information captured in the encoder hidden state representations of a neural MT system. We investigate the extent to which syntactic and lexical-semantic information from the source side is captured by hidden state representations of different neural MT architectures.Comment: PhD thesis, University of Amsterdam, October 2020. https://pure.uva.nl/ws/files/51388868/Thesis.pd
    • โ€ฆ
    corecore