222 research outputs found

    Korreferentzia-ebazpena euskarazko testuetan.

    Get PDF
    203 p.Gaur egun, korreferentzia-ebazpen automatikoa gakotzat har dezakegu testuak ulertuahal izateko; ondorioz, behar-beharrezkoa da diskurtsoaren ulerkuntza sakona eskatzenduten Lengoaia Naturalaren Prozesamenduko (NLP) hainbat atazatan.Testu bateko bi espresio testualek objektu berbera adierazi edo erreferentziatzendutenean, bi espresio horien artean korreferentzia-erlazio bat dagoela esan ohi da. Testubatean ager daitezkeen espresio testual horien arteko korreferentzia-erlazioak ebazteahelburu duen atazari korreferentzia-ebazpena deritzo.Tesi-lan hau, hizkuntzalaritza konputazionalaren arloan kokatzen da eta euskarazidatzitako testuen korreferentzia-ebazpen automatikoa du helburu, zehazkiago esanda,euskarazko korreferentzia-ebazpen automatikoa gauzatzeko dagoen baliabide eta tresnenhutsunea betetzea du helburu.Tesi-lan honetan, lehenik euskarazko testuetan ager daitezkeen espresio testualakautomatikoki identifikatzeko garatu dugun erregelatan oinarritutako tresna azaltzen da.Ondoren, Stanfordeko unibertsitatean ingeleserako diseinatu den erregelatanoinarritutako korreferentzia-ebazpenerako sistema euskararen ezaugarrietara nolaegokitu den eta ezagutza-base semantikoak erabiliz nola hobetu dugun aurkezten da.Bukatzeko, ikasketa automatikoan oinarritzen den BART korreferentzia-ebazpenerakosistema euskarara egokitzeko eta hobetzeko egindako lana azaltzen da

    ๋ฌธ๋งฅ ์ธ์‹๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋‹จ์œ„ ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ •๊ต๋ฏผ.The neural machine translation (NMT) has attracted great attention in recent years, as it has yielded state-of-the-art translation quality. Despite of their promising results, many current NMT systems are sentence-level; translating each sentence independently. This ignores contexts on text thus producing inadequate and inconsistent translations at the document-level. To overcome the shortcomings, the context-aware NMT (CNMT) has been proposed that takes contextual sentences as input. This dissertation proposes novel methods for improving the CNMT system and an application of CNMT. We first tackle the efficient modeling of multiple contextual sentences on CNMT encoder. For this purpose, we propose a hierarchical context encoder that encodes contextual sentences from token-level to sentence-level. This novel architecture enables the model to achieve state-of-the-art performance on translation quality while taking less computation time on training and translation than existing methods. Secondly, we investigate the training method for CNMT models, where most models rely on negative log-likelihood (NLL) that do not fully exploit contextual dependencies. To overcome the insufficiency, we introduce coreference-based contrastive learning for CNMT that generates contrastive examples from coreference chains between the source and target sentences. The proposed method improves pronoun resolution accuracy of CNMT models, as well as overall translation quality. Finally, we investigate an application of CNMT on dealing with Korean honorifics which depends on contextual information for generating adequate translations. For the English-Korean translation task, we propose to use CNMT models that capture crucial contextual information on the English source document and adopt a context-aware post-editing system for exploiting contexts on Korean target sentences, resulting in more consistent Korean honorific translations.์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ๊ธฐ๋ฒ•์€ ์ตœ๊ทผ ๋ฒˆ์—ญ ํ’ˆ์งˆ์— ์žˆ์–ด์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃฉํ•˜์—ฌ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์€ ํ…์ŠคํŠธ๋ฅผ ๋…๋ฆฝ๋œ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ…์ŠคํŠธ์— ์กด์žฌํ•˜๋Š” ๋ฌธ๋งฅ์„ ๋ฌด์‹œํ•˜๊ณ  ๊ฒฐ๊ตญ ๋ฌธ์„œ ๋‹จ์œ„๋กœ ํŒŒ์•…ํ–ˆ์„ ๋•Œ ์ ์ ˆํ•˜์ง€ ์•Š์€ ๋ฒˆ์—ญ๋ฌธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€ ๋ฌธ์žฅ์„ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•๋“ค๊ณผ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ € ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํ† ํฐ ๋ ˆ๋ฒจ ๋ฐ ๋ฌธ์žฅ ๋ ˆ๋ฒจ๋กœ ๋‹จ๊ณ„์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ณ„์ธต์  ๋ฌธ๋งฅ ์ธ์ฝ”๋”๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ์ œ์‹œ๋œ ๋ชจ๋ธ์€ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ์ข‹์€ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ์–ป์œผ๋ฉด์„œ ๋™์‹œ์— ํ•™์Šต ๋ฐ ๋ฒˆ์—ญ์— ๊ฑธ๋ฆฌ๋Š” ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•˜์˜€๋Š”๋ฐ ์ด๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฌธ๋งฅ์— ๋Œ€ํ•œ ์˜์กด ๊ด€๊ณ„๋ฅผ ์ „๋ถ€ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” ์ „ํ†ต์ ์ธ ์Œ์˜ ๋กœ๊ทธ์šฐ๋„ ์†์‹คํ•จ์ˆ˜์— ์˜์กดํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์„ ์œ„ํ•œ ์ƒํ˜ธ์ฐธ์กฐ์— ๊ธฐ๋ฐ˜ํ•œ ๋Œ€์กฐํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ œ์‹œ๋œ ๊ธฐ๋ฒ•์€ ์›๋ฌธ๊ณผ ์ฃผ๋ณ€ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค ์‚ฌ์ด์— ์กด์žฌํ•˜๋Š” ์ƒํ˜ธ์ฐธ์กฐ ์‚ฌ์Šฌ์„ ํ™œ์šฉํ•˜์—ฌ ๋Œ€์กฐ ์‚ฌ๋ก€๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๋ชจ๋ธ๋“ค์˜ ์ „๋ฐ˜์ ์ธ ๋ฒˆ์—ญ ํ’ˆ์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋Œ€๋ช…์‚ฌ ํ•ด๊ฒฐ ์„ฑ๋Šฅ๋„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ๋งฅ๋ฝ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ๋ฒˆ์—ญ์— ์žˆ์–ด์„œ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด์„œ๋„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ์ด์— ์˜์–ด-ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฌธ์ œ์— ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์˜์–ด ์›๋ฌธ์—์„œ ํ•„์ˆ˜์ ์ธ ๋งฅ๋ฝ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•œํŽธ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ๋ฌธ์—์„œ๋„ ๋ฌธ๋งฅ ์ธ์‹ ์‚ฌํ›„ํŽธ์ง‘ ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ผ๊ด€๋œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ํ‘œํ˜„์„ ๋ฒˆ์—ญํ•˜๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค.Abstract i Contents ii List of Tables vi List of Figures viii 1 Introduction 1 2 Background: Neural Machine Translation 7 2.1 A Brief History 7 2.2 Problem Setup 9 2.3 Encoder-Decoder architectures 10 2.3.1 RNN-based Architecture 11 2.3.2 SAN-based Architecture 13 2.4 Training 16 2.5 Decoding 16 2.6 Evaluation 17 3 Efficient Hierarchical Architecture for Modeling Contextual Sentences 18 3.1 Related works 20 3.1.1 Modeling Context in NMT 20 3.1.2 Hierarchical Context Modeling 21 3.1.3 Evaluation of Context-aware NMT 21 3.2 Model description 22 3.2.1 Context-aware NMT encoders 22 3.2.2 Hierarchical context encoder 27 3.3 Data 28 3.3.1 English-German IWSLT 2017 corpus 29 3.3.2 OpenSubtitles corpus 29 3.3.3 English-Korean subtitle corpus 31 3.4 Experiments 31 3.4.1 Hyperparameters and Training details 31 3.4.2 Overall BLEU evaluation 32 3.4.3 Model complexity analysis 32 3.4.4 BLEU evaluation on helpful/unhelpful context 34 3.4.5 EnKo pronoun resolution test suite 35 3.4.6 Qualitative Analysis 37 3.5 Summary of Efficient Hierarchical Architecture for Modeling Contextual Sentences 43 4 Contrastive Learning for Context-aware Neural Machine Translation 44 4.1 Related Works 46 4.1.1 Context-aware NMT Architectures 46 4.1.2 Coreference and NMT 47 4.1.3 Data augmentation for NMT 47 4.1.4 Contrastive Learning 47 4.2 Context-aware NMT models 48 4.3 Our Method: CorefCL 50 4.3.1 Data Augmentation Using Coreference 50 4.3.2 Contrastive Learning for Context-aware NMT 52 4.4 Experiments 53 4.4.1 Datasets 53 4.4.2 Settings 54 4.4.3 Overall BLEU Evaluation 55 4.4.4 Results on English-German Contrastive Evaluation Set 57 4.4.5 Analysis 58 4.5 Summary of Contrastive Learning for Context-aware Neural Machine Translation 59 5 Improving English-Korean Honorific Translation Using Contextual Information 60 5.1 Related Works 63 5.1.1 Neural Machine Translation dealing with Korean 63 5.1.2 Controlling the Styles in NMT 63 5.1.3 Context-Aware NMT Framework and Application 64 5.2 Addressing Korean Honorifics in Context 65 5.2.1 Overview of Korean Honorifics System 65 5.2.2 The Role of Context on Choosing Honorifics 68 5.3 Context-Aware NMT Frameworks 69 5.3.1 NMT Model with Contextual Encoders 71 5.3.2 Context-Aware Post Editing (CAPE) 71 5.4 Our Proposed Method - Context-Aware NMT for Korean Honorifics 73 5.4.1 Using CNMT methods for Honorific-Aware Translation 74 5.4.2 Scope of Honorific Expressions 75 5.4.3 Automatic Honorific Labeling 76 5.5 Experiments 77 5.5.1 Dataset and Preprocessing 77 5.5.2 Model Implementation and Training Details 80 5.5.3 Metrics 80 5.5.4 Results 81 5.5.5 Translation Examples and Analysis 86 5.6 Summary of Improving English-Korean Honorific Translation Using Contextual Information 89 6 Future Directions 91 6.1 Document-level Datasets 91 6.2 Document-level Evaluation 92 6.3 Bias and Fairness of Document-level NMT 93 6.4 Towards Practical Applications 94 7 Conclusions 96 Abstract (In Korean) 117 Acknowledgment 119๋ฐ•

    Towards Multilingual Coreference Resolution

    Get PDF
    The current work investigates the problems that occur when coreference resolution is considered as a multilingual task. We assess the issues that arise when a framework using the mention-pair coreference resolution model and memory-based learning for the resolution process are used. Along the way, we revise three essential subtasks of coreference resolution: mention detection, mention head detection and feature selection. For each of these aspects we propose various multilingual solutions including both heuristic, rule-based and machine learning methods. We carry out a detailed analysis that includes eight different languages (Arabic, Catalan, Chinese, Dutch, English, German, Italian and Spanish) for which datasets were provided by the only two multilingual shared tasks on coreference resolution held so far: SemEval-2 and CoNLL-2012. Our investigation shows that, although complex, the coreference resolution task can be targeted in a multilingual and even language independent way. We proposed machine learning methods for each of the subtasks that are affected by the transition, evaluated and compared them to the performance of rule-based and heuristic approaches. Our results confirmed that machine learning provides the needed flexibility for the multilingual task and that the minimal requirement for a language independent system is a part-of-speech annotation layer provided for each of the approached languages. We also showed that the performance of the system can be improved by introducing other layers of linguistic annotations, such as syntactic parses (in the form of either constituency or dependency parses), named entity information, predicate argument structure, etc. Additionally, we discuss the problems occurring in the proposed approaches and suggest possibilities for their improvement

    ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT

    Get PDF
    We present ParCor, a parallel corpus of texts in which pronoun coreference โ€“ reduced coreference in which pronouns are used as referringexpressions โ€“ has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences inpronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed ataddressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-Germandocuments from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). Alldocuments in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, itsantecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics.The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, aswell as other genres, in the future

    Modeling contextual information in neural machine translation

    Get PDF
    Machine translation has provided impressive translation quality for many language pairs. The improvements over the past few years are largely due to the introduction of neural networks to the field, resulting in the modern sequence-to-sequence neural machine translation models. NMT is at the core of many largescale industrial tools for automatic translation such as Google Translate, Microsoft Translator, Amazon Translate and many others. Current NMT models work on the sentence-level, meaning they are used to translate individual sentences. However, for most practical use-cases, a user is interested in translating a document. In these cases, an MT tool splits a document into individual sentences and translates them independently. As a result, any dependencies between the sentences are ignored. This is likely to result in an incoherent document translation, mainly because of inconsistent translation of ambiguous source words or wrong translation of anaphoric pronouns. For example, it is undesirable to translate โ€œbankโ€ as a โ€œfinancial bankโ€ in one sentence and then later as a โ€œriver bankโ€. Furthermore, the translation of, e.g., the English third person pronoun โ€œitโ€ into German depends on the grammatical gender of the English antecedentโ€™s German translation. NMT has shown that it has impressive modeling capabilities, but is nevertheless unable to model discourse-level phenomena as it needs access to contextual information. In this work, we study discourse-level phenomena in context-aware NMT. To facilitate the particular studies of interest, we propose several models capable of incorporating contextual information into standard sentence-level NMT models. We direct our focus on several discourse phenomena, namely, coreference (anaphora) resolution, coherence and cohesion. We discuss these phenomena in terms of how well can they be modeled by context-aware NMT, how can we improve upon current state-of-the-art as well as the optimal granularity at which these phenomena should be modeled. We further investigate domain as a factor in context-aware NMT. Finally, we investigate existing challenge sets for anaphora resolution evaluation and provide a robust alternative. We make the following contributions: i) We study the importance of coreference (anaphora) resolution and coherence for context-aware NMT by making use of oracle information specific to these phenomena. ii) We propose a method for improving performance on anaphora resolution based on curriculum learning which is inspired by the way humans organize learning. iii) We investigate the use of contextual information for better handling of domain information, in particular in the case of modeling multiple domains at once and when applied to zero-resource domains. iv) We present several context-aware models to enable us to examine the specific phenomena of interest we already mentioned. v) We study the optimal way of modeling local and global context and present a model theoretically capable of using very large document context. vi) We study the robustness of challenge sets for evaluation of anaphora resolution in MT by means of adversarial attacks and provide a template test set that robustly evaluates specific steps of an idealized coreference resolution pipeline for MT

    Mining semantics for culturomics: towards a knowledge-based approach

    Get PDF
    The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods

    Computational modelling of coreference and bridging resolution

    Get PDF

    Neural Graph Transfer Learning in Natural Language Processing Tasks

    Get PDF
    Natural language is essential in our daily lives as we rely on languages to communicate and exchange information. A fundamental goal for natural language processing (NLP) is to let the machine understand natural language to help or replace human experts to mine knowledge and complete tasks. Many NLP tasks deal with sequential data. For example, a sentence is considered as a sequence of works. Very recently, deep learning-based language models (i.e.,BERT \citep{devlin2018bert}) achieved significant improvement in many existing tasks, including text classification and natural language inference. However, not all tasks can be formulated using sequence models. Specifically, graph-structured data is also fundamental in NLP, including entity linking, entity classification, relation extraction, abstractive meaning representation, and knowledge graphs \citep{santoro2017simple,hamilton2017representation,kipf2016semi}. In this scenario, BERT-based pretrained models may not be suitable. Graph Convolutional Neural Network (GCN) \citep{kipf2016semi} is a deep neural network model designed for graphs. It has shown great potential in text classification, link prediction, question answering and so on. This dissertation presents novel graph models for NLP tasks, including text classification, prerequisite chain learning, and coreference resolution. We focus on different perspectives of graph convolutional network modeling: for text classification, a novel graph construction method is proposed which allows interpretability for the prediction; for prerequisite chain learning, we propose multiple aggregation functions that utilize neighbors for better information exchange; for coreference resolution, we study how graph pretraining can help when labeled data is limited. Moreover, an important branch is to apply pretrained language models for the mentioned tasks. So, this dissertation also focuses on the transfer learning method that generalizes pretrained models to other domains, including medical, cross-lingual, and web data. Finally, we propose a new task called unsupervised cross-domain prerequisite chain learning, and study novel graph-based methods to transfer knowledge over graphs
    • โ€ฆ
    corecore