    A corpus-based study on synesthesia in Korean ordinary language

    Combining textual features with sentence embeddings

    학위논문(박사) -- 서울대학교대학원 : 인문대학 언어학과, 2021.8. 박수지.이 논문의 목표는 한국어 기사 품질을 예측하기 위한 언어 모형을 개발하는 것이다. 기사 품질 예측 과제는 최근 가짜뉴스 등의 범람으로 그 필요성이 대두되면서도 자연언어처리의 최신 기법이 아직 적용되지 못하는 실정에 있다. 이 논문에서는 이러한 한계를 극복하기 위해 문장의 의미를 표상하는 SBERT 모형을 개발하고, 기사의 언어학적 자질을 활용하여 품질 분류의 성능을 높일 수 있는지를 검토하고자 한다. 그 결과 기사의 가독성, 응집성 등의 텍스트 자질을 사용한 기계학습 모형과 SBERT에서 자동으로 추출된 문맥 자질을 사용한 전이학습 모형이 모두 선행연구의 심층학습 결과보다 높은 성능을 보였고, 구체적으로는 SBERT 학습시 훈련 데이터를 확장하고 정제할 때, 그리고 텍스트 자질과 문맥 자질을 함께 사용할 때 성능이 더욱 향상되는 것을 관측하였다. 이를 통해 기사의 품질에서 언어학적 자질이 중요한 역할을 하며 자연언어처리의 최신 기법인 SBERT가 언어학적 자질을 추출하고 활용하는 데 실질적으로 기여할 수 있다는 결론을 내릴 수 있다.1 Introduction 1 2 Literature Review 5 2.1 Background 5 2.1.1 Text Classification 5 Initial Studies 5 News Classification 6 2.1.2 Text Quality Assessment 8 2.2 News Quality Prediction Task 9 2.2.1 News Data 9 Online vs. Offline 9 Expert-rated vs. User-rated 9 2.2.2 Prediction Methods 11 Manually Engineered Features v. Automatically Extracted Features 11 Machine Learning vs. Deep Learning 12 2.3 Instruments and Techniques 14 2.3.1 Sentence and Document Embeddings 14 Static Embeddings 14 Contextual Embeddings 16 2.3.2 Fusion Models 18 2.4 Summary 27 3 Methods 29 3.1 Data from Choi, Shin, and Kang (2021) 29 3.1.1 News Corpus 29 3.1.2 Quality Levels 29 3.1.3 Journalism Values 30 3.2 Linguistic Features 31 3.2.1 Justification of Using Linguistic Features Only 31 3.2.2 Two Types of Linguistic Features 32 Textual Features 32 Contextual Features 33 3.3 Summary 33 4 Ordinal Logistic Regression Models with Textual Features 35 4.1 Textual Features 35 4.1.1 Coh-Metrix 35 4.1.2 KOSAC Lexicon 36 4.1.3 K-LIWC 38 4.1.4 Others 38 4.2 Ordinal Logistic Regression 38 4.3 Results 39 4.3.1 Feature Selection 39 4.3.2 Impacts on Quality Evaluation 40 4.4 Discussion 40 4.4.1 Effect of Cosine Similarity by Issue 41 4.4.2 Effect of Quantitative Evidence 47 4.4.3 Effect of Sentiment 48 4.5 Summary 51 5 Deep Transfer Learning Models with Contextual Features 53 5.1 Contextual Features from SentenceBERT 53 5.1.1 Necessity of Sentence Embeddings 54 5.1.2 KR-SBERT 55 5.2 Deep Transfer Learning 56 5.3 Results 59 5.3.1 Measures of Multiclass Classification 59 5.3.2 Performances of news quality prediction models 60 5.4 Discussion 62 5.4.1 Effect of Data Size 62 5.4.2 Effect of Data Augmentation 62 5.4.3 Effect of Data Refinement 635.5 Summary 63 6 Fusion Models Combining Textual Features with Contextual Sentence Embeddings 65 6.1 Model Fusion 65 6.1.1 Feature-level Fusion: Concatenation 65 6.1.2 Logit-level Fusion: Interpolation 65 6.2 Results 68 6.2.1 Optimization of the Presentational Attribute Model 68 6.2.2 Performances of News Quality Prediction Models 68 6.3 Discussion 68 6.3.1 Effects of Fusion 70 6.3.2 Comparison with Choi et al. (2021) 71 6.4 Summary 71 7 Conclusion 73 References 75 A List of Words Used for Textual Feature Extraction 93 A.1 Coh-Metrix Features 93 A.2 Predicate Type Features 94 B Codes Used in Chapter 4 97 B.1 Python Code for Textual Feature Extraction 97 C Results of VIF test and Brant test 101 C.1 VIF Test in R 101 C.2 Brant Test in R 103 D Codes Used in Chapter 6 107 D.1 Python Code for Feature-Level Fusion 107 D.2 Python Code for Logit-Level Fusion 108박

    Machine translation and Korean language

    This bachelor thesis deals with machine translation in relation to Korean language and as a . First part is a theoretical part in which are presented main directions of machine translation in chronological order and advantages and disadvantages of these machine translation models including examples from Korean language. Second, practical part introduces modern global machine translators Naver Papago and Google Translate. These translators are given text examples in Korean from publicistic and commerce field and also from spoken language. This part also includes proper English translation for comparison with translations produced by mentioned machine translators. The purpose of these translations is to provide an insight to typical errors in translations which are analyzed from the view of theoretical part which allows us to see a connection between theory and practice. Both translators are also compared with each other and act as an example of how same machine translation models produce different translations and mistakes. Last part contains a brief summary of the current state of machine translation and its expectations for the future.Bakalářská práce se zabývá strojovým překladem v souvislosti s korejštinou. První část je teoretickou částí, kde jsou představeny hlavní směry strojového překladu v chronologickém pořadí a výhody a nevýhody těchto modelů strojového překladu včetně příkladů z korejského jazyka. Druhá, praktická část představuje moderní světové strojové překladače Naver Papago a Google Translate. Těmto překladačům jsou předloženy příklady textů v korejštině z publicistického a komerčního odvětví a také z odvětví mluvené řeči. Tato část také obsahuje správné anglické překlady pro porovnání s překlady vyprodukovanými zmiňovanými překladači. Cílem těchto překladů je poskytnout nahlédnutí na typické chyby v překladu a tyto chyby jsou analyzovány z pohledu teoretické části, což umožňuje vytvořit spojení mezi teorií a praxí. Oba překladače jsou pak také se sebou porovnány a slouží jako příklad jak stejné modely strojového překladu produkcí různé překlady a s nimi i chyby. Poslední část obsahuje krátké shrnutí současného stavu strojového překladu a jeho vyhlídky do budoucna.Katedra sinologieDepartment of SinologyFaculty of ArtsFilozofická fakult

    Statistical models for case ambiguity resolution in Korean

    The written British National Corpus 2014:design, compilation and analysis

    The ESRC-funded Centre for Corpus Approaches to Social Science at Lancaster University (CASS) and the English Language Teaching Group at Cambridge University Press (CUP) have collaborated to compile a new, publicly accessible corpus of contemporary Written British English, known as the Written British National Corpus 2014 (Written BNC2014). The Written BNC2014 is an updated version of the Written British National Corpus (Written BNC1994) which was created in the 1990s. The Written BNC1994 is often used as a proxy for present day British English, so the Written BNC2014 has been created in order to allow for both comparisons between the two corpora, and also to allow for research on British English to be carried out using a state-of-the-art contemporary data-set. The Written BNC2014 contains approximately 90 million words of written British English, published between 2010-2018, from a wide variety of genres. The corpus will be publicly released in 2019. This thesis presents a detailed account of the design and compilation of the corpus, focusing on the very many challenges which needed to be overcome in order to create the corpus, along with the solutions to these challenges which were devised. It also demonstrates the utility of the corpus, by presenting a diachronic comparison of academic writing in the 1990s and 2010s, with a focus on the theory of colloquialisation. This thesis, whilst not a Written BNC2014 user-guide, presents all of the decisions made in the design and creation of the corpus, and as such, will help to make the corpus as useful to as many people, for as many purposes, as possible