599 research outputs found

    Neural Coreference Resolution for Turkish

    Get PDF
    Coreference resolution deals with resolving mentions of the same underlying entity in a given text. This challenging task is an indispensable aspect of text understanding and has important applications in various language processing systems such as question answering and machine translation. Although a significant amount of studies is devoted to coreference resolution, the research on Turkish is scarce and mostly limited to pronoun resolution. To our best knowledge, this article presents the first neural Turkish coreference resolution study where two learning-based models are explored. Both models follow the mention-ranking approach while forming clusters of mentions. The first model uses a set of hand-crafted features whereas the second coreference model relies on embeddings learned from large-scale pre-trained language models for capturing similarities between a mention and its candidate antecedents. Several language models trained specifically for Turkish are used to obtain mention representations and their effectiveness is compared in conducted experiments using automatic metrics. We argue that the results of this study shed light on the possible contributions of neural architectures to Turkish coreference resolution.119683

    Conditional Random Field Autoencoders for Unsupervised Structured Prediction

    Full text link
    We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is (re)generated, conditional on the latent structure, using models for which maximum likelihood estimation has a closed-form. Our autoencoder formulation enables efficient learning without making unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines

    Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022

    Get PDF
    We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese \& Subtask 4 English.Comment: To appear in CASE 2022 @ EMNLP 202

    ๋ฌธ๋งฅ ์ธ์‹๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋‹จ์œ„ ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022.2. ์ •๊ต๋ฏผ.The neural machine translation (NMT) has attracted great attention in recent years, as it has yielded state-of-the-art translation quality. Despite of their promising results, many current NMT systems are sentence-level; translating each sentence independently. This ignores contexts on text thus producing inadequate and inconsistent translations at the document-level. To overcome the shortcomings, the context-aware NMT (CNMT) has been proposed that takes contextual sentences as input. This dissertation proposes novel methods for improving the CNMT system and an application of CNMT. We first tackle the efficient modeling of multiple contextual sentences on CNMT encoder. For this purpose, we propose a hierarchical context encoder that encodes contextual sentences from token-level to sentence-level. This novel architecture enables the model to achieve state-of-the-art performance on translation quality while taking less computation time on training and translation than existing methods. Secondly, we investigate the training method for CNMT models, where most models rely on negative log-likelihood (NLL) that do not fully exploit contextual dependencies. To overcome the insufficiency, we introduce coreference-based contrastive learning for CNMT that generates contrastive examples from coreference chains between the source and target sentences. The proposed method improves pronoun resolution accuracy of CNMT models, as well as overall translation quality. Finally, we investigate an application of CNMT on dealing with Korean honorifics which depends on contextual information for generating adequate translations. For the English-Korean translation task, we propose to use CNMT models that capture crucial contextual information on the English source document and adopt a context-aware post-editing system for exploiting contexts on Korean target sentences, resulting in more consistent Korean honorific translations.์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ ๊ธฐ๋ฒ•์€ ์ตœ๊ทผ ๋ฒˆ์—ญ ํ’ˆ์งˆ์— ์žˆ์–ด์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃฉํ•˜์—ฌ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์€ ํ…์ŠคํŠธ๋ฅผ ๋…๋ฆฝ๋œ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ…์ŠคํŠธ์— ์กด์žฌํ•˜๋Š” ๋ฌธ๋งฅ์„ ๋ฌด์‹œํ•˜๊ณ  ๊ฒฐ๊ตญ ๋ฌธ์„œ ๋‹จ์œ„๋กœ ํŒŒ์•…ํ–ˆ์„ ๋•Œ ์ ์ ˆํ•˜์ง€ ์•Š์€ ๋ฒˆ์—ญ๋ฌธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€ ๋ฌธ์žฅ์„ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•๋“ค๊ณผ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ € ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค์„ ํ† ํฐ ๋ ˆ๋ฒจ ๋ฐ ๋ฌธ์žฅ ๋ ˆ๋ฒจ๋กœ ๋‹จ๊ณ„์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ณ„์ธต์  ๋ฌธ๋งฅ ์ธ์ฝ”๋”๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ์ œ์‹œ๋œ ๋ชจ๋ธ์€ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ์ข‹์€ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ์–ป์œผ๋ฉด์„œ ๋™์‹œ์— ํ•™์Šต ๋ฐ ๋ฒˆ์—ญ์— ๊ฑธ๋ฆฌ๋Š” ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•˜์˜€๋Š”๋ฐ ์ด๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฌธ๋งฅ์— ๋Œ€ํ•œ ์˜์กด ๊ด€๊ณ„๋ฅผ ์ „๋ถ€ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” ์ „ํ†ต์ ์ธ ์Œ์˜ ๋กœ๊ทธ์šฐ๋„ ์†์‹คํ•จ์ˆ˜์— ์˜์กดํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ๋ชจ๋ธ์„ ์œ„ํ•œ ์ƒํ˜ธ์ฐธ์กฐ์— ๊ธฐ๋ฐ˜ํ•œ ๋Œ€์กฐํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ œ์‹œ๋œ ๊ธฐ๋ฒ•์€ ์›๋ฌธ๊ณผ ์ฃผ๋ณ€ ๋ฌธ๋งฅ ๋ฌธ์žฅ๋“ค ์‚ฌ์ด์— ์กด์žฌํ•˜๋Š” ์ƒํ˜ธ์ฐธ์กฐ ์‚ฌ์Šฌ์„ ํ™œ์šฉํ•˜์—ฌ ๋Œ€์กฐ ์‚ฌ๋ก€๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๋ชจ๋ธ๋“ค์˜ ์ „๋ฐ˜์ ์ธ ๋ฒˆ์—ญ ํ’ˆ์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋Œ€๋ช…์‚ฌ ํ•ด๊ฒฐ ์„ฑ๋Šฅ๋„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ๋งฅ๋ฝ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ๋ฒˆ์—ญ์— ์žˆ์–ด์„œ ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด์„œ๋„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ์ด์— ์˜์–ด-ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฌธ์ œ์— ๋ฌธ๋งฅ ์ธ์‹ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋ฒˆ์—ญ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์˜์–ด ์›๋ฌธ์—์„œ ํ•„์ˆ˜์ ์ธ ๋งฅ๋ฝ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•œํŽธ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ๋ฌธ์—์„œ๋„ ๋ฌธ๋งฅ ์ธ์‹ ์‚ฌํ›„ํŽธ์ง‘ ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ผ๊ด€๋œ ํ•œ๊ตญ์–ด ๊ฒฝ์–ด์ฒด ํ‘œํ˜„์„ ๋ฒˆ์—ญํ•˜๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค.Abstract i Contents ii List of Tables vi List of Figures viii 1 Introduction 1 2 Background: Neural Machine Translation 7 2.1 A Brief History 7 2.2 Problem Setup 9 2.3 Encoder-Decoder architectures 10 2.3.1 RNN-based Architecture 11 2.3.2 SAN-based Architecture 13 2.4 Training 16 2.5 Decoding 16 2.6 Evaluation 17 3 Efficient Hierarchical Architecture for Modeling Contextual Sentences 18 3.1 Related works 20 3.1.1 Modeling Context in NMT 20 3.1.2 Hierarchical Context Modeling 21 3.1.3 Evaluation of Context-aware NMT 21 3.2 Model description 22 3.2.1 Context-aware NMT encoders 22 3.2.2 Hierarchical context encoder 27 3.3 Data 28 3.3.1 English-German IWSLT 2017 corpus 29 3.3.2 OpenSubtitles corpus 29 3.3.3 English-Korean subtitle corpus 31 3.4 Experiments 31 3.4.1 Hyperparameters and Training details 31 3.4.2 Overall BLEU evaluation 32 3.4.3 Model complexity analysis 32 3.4.4 BLEU evaluation on helpful/unhelpful context 34 3.4.5 EnKo pronoun resolution test suite 35 3.4.6 Qualitative Analysis 37 3.5 Summary of Efficient Hierarchical Architecture for Modeling Contextual Sentences 43 4 Contrastive Learning for Context-aware Neural Machine Translation 44 4.1 Related Works 46 4.1.1 Context-aware NMT Architectures 46 4.1.2 Coreference and NMT 47 4.1.3 Data augmentation for NMT 47 4.1.4 Contrastive Learning 47 4.2 Context-aware NMT models 48 4.3 Our Method: CorefCL 50 4.3.1 Data Augmentation Using Coreference 50 4.3.2 Contrastive Learning for Context-aware NMT 52 4.4 Experiments 53 4.4.1 Datasets 53 4.4.2 Settings 54 4.4.3 Overall BLEU Evaluation 55 4.4.4 Results on English-German Contrastive Evaluation Set 57 4.4.5 Analysis 58 4.5 Summary of Contrastive Learning for Context-aware Neural Machine Translation 59 5 Improving English-Korean Honorific Translation Using Contextual Information 60 5.1 Related Works 63 5.1.1 Neural Machine Translation dealing with Korean 63 5.1.2 Controlling the Styles in NMT 63 5.1.3 Context-Aware NMT Framework and Application 64 5.2 Addressing Korean Honorifics in Context 65 5.2.1 Overview of Korean Honorifics System 65 5.2.2 The Role of Context on Choosing Honorifics 68 5.3 Context-Aware NMT Frameworks 69 5.3.1 NMT Model with Contextual Encoders 71 5.3.2 Context-Aware Post Editing (CAPE) 71 5.4 Our Proposed Method - Context-Aware NMT for Korean Honorifics 73 5.4.1 Using CNMT methods for Honorific-Aware Translation 74 5.4.2 Scope of Honorific Expressions 75 5.4.3 Automatic Honorific Labeling 76 5.5 Experiments 77 5.5.1 Dataset and Preprocessing 77 5.5.2 Model Implementation and Training Details 80 5.5.3 Metrics 80 5.5.4 Results 81 5.5.5 Translation Examples and Analysis 86 5.6 Summary of Improving English-Korean Honorific Translation Using Contextual Information 89 6 Future Directions 91 6.1 Document-level Datasets 91 6.2 Document-level Evaluation 92 6.3 Bias and Fairness of Document-level NMT 93 6.4 Towards Practical Applications 94 7 Conclusions 96 Abstract (In Korean) 117 Acknowledgment 119๋ฐ•

    The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations

    Get PDF
    In this paper, we report on the annotation procedures we developed for annotating the Turkish Discourse Bank (TDB), an effort that extends the Penn Discourse Tree Bank (PDTB) annotation style by using it for annotating Turkish discourse. After a brief introduction to the TDB, we describe the annotation cycle and the annotation scheme we developed, defining which parts of the scheme are an extension of the PDTB and which parts are different. We provide inter-coder reliability calculations on the first and second arguments of some connectives and discuss the most important sources of disagreement among annotators
    • โ€ฆ
    corecore