Search CORE

2,644 research outputs found

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Author: Cho Won Ik
Kim Nam Soo
Kim Seok Min
Publication venue: Waseda Institute for the Study of Language and Information
Publication date: 01/01/2019
Field of study

A Survey on Awesome Korean NLP Datasets

Author: Ban Byunghyun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/12/2021
Field of study

English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl

arXiv.org e-Print Archive

음성언어 이해에서의 중의성 해소

Author: 조원익
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 김남수.언어의 중의성은 필연적이다. 그것은 언어가 의사 소통의 수단이지만, 모든 사람이 생각하는 어떤 개념이 완벽히 동일하게 전달될 수 없는 것에 기인한다. 이는 필연적인 요소이기도 하지만, 언어 이해에서 중의성은 종종 의사 소통의 단절이나 실패를 가져오기도 한다. 언어의 중의성에는 다양한 층위가 존재한다. 하지만, 모든 상황에서 중의성이 해소될 필요는 없다. 태스크마다, 도메인마다 다른 양상의 중의성이 존재하며, 이를 잘 정의하고 해소될 수 있는 중의성임을 파악한 후 중의적인 부분 간의 경계를 잘 정하는 것이 중요하다. 본고에서는 음성 언어 처리, 특히 의도 이해에 있어 어떤 양상의 중의성이 발생할 수 있는지 알아보고, 이를 해소하기 위한 연구를 진행한다. 이러한 현상은 다양한 언어에서 발생하지만, 그 정도 및 양상은 언어에 따라서 다르게 나타나는 경우가 많다. 우리의 연구에서 주목하는 부분은, 음성 언어에 담긴 정보량과 문자 언어의 정보량 차이로 인해 중의성이 발생하는 경우들이다. 본 연구는 운율(prosody)에 따라 문장 형식 및 의도가 다르게 표현되는 경우가 많은 한국어를 대상으로 진행된다. 한국어에서는 다양한 기능이 있는(multi-functional한) 종결어미(sentence ender), 빈번한 탈락 현상(pro-drop), 의문사 간섭(wh-intervention) 등으로 인해, 같은 텍스트가 여러 의도로 읽히는 현상이 발생하곤 한다. 이것이 의도 이해에 혼선을 가져올 수 있다는 데에 착안하여, 본 연구에서는 이러한 중의성을 먼저 정의하고, 중의적인 문장들을 감지할 수 있도록 말뭉치를 구축한다. 의도 이해를 위한 말뭉치를 구축하는 과정에서 문장의 지향성(directivity)과 수사성(rhetoricalness)이 고려된다. 이것은 음성 언어의 의도를 서술, 질문, 명령, 수사의문문, 그리고 수사명령문으로 구분하게 하는 기준이 된다. 본 연구에서는 기록된 음성 언어(spoken language)를 충분히 높은 일치도(kappa = 0.85)로 주석한 말뭉치를 이용해, 음성이 주어지지 않은 상황에서 중의적인 텍스트를 감지하는 데에 어떤 전략 혹은 언어 모델이 효과적인가를 보이고, 해당 태스크의 특징을 정성적으로 분석한다. 또한, 우리는 텍스트 층위에서만 중의성에 접근하지 않고, 실제로 음성이 주어진 상황에서 중의성 해소(disambiguation)가 가능한지를 알아보기 위해, 텍스트가 중의적인 발화들만으로 구성된 인공적인 음성 말뭉치를 설계하고 다양한 집중(attention) 기반 신경망(neural network) 모델들을 이용해 중의성을 해소한다. 이 과정에서 모델 기반 통사적/의미적 중의성 해소가 어떠한 경우에 가장 효과적인지 관찰하고, 인간의 언어 처리와 어떤 연관이 있는지에 대한 관점을 제시한다. 본 연구에서는 마지막으로, 위와 같은 절차로 의도 이해 과정에서의 중의성이 해소되었을 경우, 이를 어떻게 산업계 혹은 연구 단에서 활용할 수 있는가에 대한 간략한 로드맵을 제시한다. 텍스트에 기반한 중의성 파악과 음성 기반의 의도 이해 모듈을 통합한다면, 오류의 전파를 줄이면서도 효율적으로 중의성을 다룰 수 있는 시스템을 만들 수 있을 것이다. 이러한 시스템은 대화 매니저(dialogue manager)와 통합되어 간단한 대화(chit-chat)가 가능한 목적 지향 대화 시스템(task-oriented dialogue system)을 구축할 수도 있고, 단일 언어 조건(monolingual condition)을 넘어 음성 번역에서의 에러를 줄이는 데에 활용될 수도 있다. 우리는 본고를 통해, 운율에 민감한(prosody-sensitive) 언어에서 의도 이해를 위한 중의성 해소가 가능하며, 이를 산업 및 연구 단에서 활용할 수 있음을 보이고자 한다. 본 연구가 다른 언어 및 도메인에서도 고질적인 중의성 문제를 해소하는 데에 도움이 되길 바라며, 이를 위해 연구를 진행하는 데에 활용된 리소스, 결과물 및 코드들을 공유함으로써 학계의 발전에 이바지하고자 한다.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126박

SNU Open Repository and Archive

한국어 사전학습모델 구축과 확장 연구: 감정분석을 중심으로

Author: 이상아
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 인문대학 언어학과, 2021. 2. 신효필.Recently, as interest in the Bidirectional Encoder Representations from Transformers (BERT) model has increased, many studies have also been actively conducted in Natural Language Processing based on the model. Such sentence-level contextualized embedding models are generally known to capture and model lexical, syntactic, and semantic information in sentences during training. Therefore, such models, including ELMo, GPT, and BERT, function as a universal model that can impressively perform a wide range of NLP tasks. This study proposes a monolingual BERT model trained based on Korean texts. The first released BERT model that can handle the Korean language was Google Research’s multilingual BERT (M-BERT), which was constructed with training data and a vocabulary composed of 104 languages, including Korean and English, and can handle the text of any language contained in the single model. However, despite the advantages of multilingualism, this model does not fully reflect each language’s characteristics, so that its text processing performance in each language is lower than that of a monolingual model. While mitigating those shortcomings, we built monolingual models using the training data and a vocabulary organized to better capture Korean texts’ linguistic knowledge. Therefore, in this study, a model named KR-BERT was built using training data composed of Korean Wikipedia text and news articles, and was released through GitHub so that it could be used for processing Korean texts. Additionally, we trained a KR-BERT-MEDIUM model based on expanded data by adding comments and legal texts to the training data of KR-BERT. Each model used a list of tokens composed mainly of Hangul characters as its vocabulary, organized using WordPiece algorithms based on the corresponding training data. These models reported competent performances in various Korean NLP tasks such as Named Entity Recognition, Question Answering, Semantic Textual Similarity, and Sentiment Analysis. In addition, we added sentiment features to the BERT model to specialize it to better function in sentiment analysis. We constructed a sentiment-combined model including sentiment features, where the features consist of polarity and intensity values assigned to each token in the training data corresponding to that of Korean Sentiment Analysis Corpus (KOSAC). The sentiment features assigned to each token compose polarity and intensity embeddings and are infused to the basic BERT input embeddings. The sentiment-combined model is constructed by training the BERT model with these embeddings. We trained a model named KR-BERT-KOSAC that contains sentiment features while maintaining the same training data, vocabulary, and model configurations as KR-BERT and distributed it through GitHub. Then we analyzed the effects of using sentiment features in comparison to KR-BERT by observing their performance in language modeling during the training process and sentiment analysis tasks. Additionally, we determined how much each of the polarity and intensity features contributes to improving the model performance by separately organizing a model that utilizes each of the features, respectively. We obtained some increase in language modeling and sentiment analysis performances by using both the sentiment features, compared to other models with different feature composition. Here, we included the problems of binary positivity classification of movie reviews and hate speech detection on offensive comments as the sentiment analysis tasks. On the other hand, training these embedding models requires a lot of training time and hardware resources. Therefore, this study proposes a simple model fusing method that requires relatively little time. We trained a smaller-scaled sentiment-combined model consisting of a smaller number of encoder layers and attention heads and smaller hidden sizes for a few steps, combining it with an existing pre-trained BERT model. Since those pre-trained models are expected to function universally to handle various NLP problems based on good language modeling, this combination will allow two models with different advantages to interact and have better text processing capabilities. In this study, experiments on sentiment analysis problems have confirmed that combining the two models is efficient in training time and usage of hardware resources, while it can produce more accurate predictions than single models that do not include sentiment features.최근 트랜스포머 양방향 인코더 표현 (Bidirectional Encoder Representations from Transformers, BERT) 모델에 대한 관심이 높아지면서 자연어처리 분야에서 이에 기반한 연구 역시 활발히 이루어지고 있다. 이러한 문장 단위의 임베딩을 위한 모델들은 보통 학습 과정에서 문장 내 어휘, 통사, 의미 정보를 포착하여 모델링한다고 알려져 있다. 따라서 ELMo, GPT, BERT 등은 그 자체가 다양한 자연어처리 문제를 해결할 수 있는 보편적인 모델로서 기능한다. 본 연구는 한국어 자료로 학습한 단일 언어 BERT 모델을 제안한다. 가장 먼저 공개된 한국어를 다룰 수 있는 BERT 모델은 Google Research의 multilingual BERT (M-BERT)였다. 이는 한국어와 영어를 포함하여 104개 언어로 구성된 학습 데이터와 어휘 목록을 가지고 학습한 모델이며, 모델 하나로 포함된 모든 언어의 텍스트를 처리할 수 있다. 그러나 이는 그 다중언어성이 갖는 장점에도 불구하고, 각 언어의 특성을 충분히 반영하지 못하여 단일 언어 모델보다 각 언어의 텍스트 처리 성능이 낮다는 단점을 보인다. 본 연구는 그러한 단점들을 완화하면서 텍스트에 포함되어 있는 언어 정보를 보다 잘 포착할 수 있도록 구성된 데이터와 어휘 목록을 이용하여 모델을 구축하고자 하였다. 따라서 본 연구에서는 한국어 Wikipedia 텍스트와 뉴스 기사로 구성된 데이터를 이용하여 KR-BERT 모델을 구현하고, 이를 GitHub을 통해 공개하여 한국어 정보처리를 위해 사용될 수 있도록 하였다. 또한 해당 학습 데이터에 댓글 데이터와 법조문과 판결문을 덧붙여 확장한 텍스트에 기반해서 다시 KR-BERT-MEDIUM 모델을 학습하였다. 이 모델은 해당 학습 데이터로부터 WordPiece 알고리즘을 이용해 구성한 한글 중심의 토큰 목록을 사전으로 이용하였다. 이들 모델은 개체명 인식, 질의응답, 문장 유사도 판단, 감정 분석 등의 다양한 한국어 자연어처리 문제에 적용되어 우수한 성능을 보고했다. 또한 본 연구에서는 BERT 모델에 감정 자질을 추가하여 그것이 감정 분석에 특화된 모델로서 확장된 기능을 하도록 하였다. 감정 자질을 포함하여 별도의 임베딩 모델을 학습시켰는데, 이때 감정 자질은 문장 내의 각 토큰에 한국어 감정 분석 코퍼스 (KOSAC)에 대응하는 감정 극성(polarity)과 강도(intensity) 값을 부여한 것이다. 각 토큰에 부여된 자질은 그 자체로 극성 임베딩과 강도 임베딩을 구성하고, BERT가 기본으로 하는 토큰 임베딩에 더해진다. 이렇게 만들어진 임베딩을 학습한 것이 감정 자질 모델(sentiment-combined model)이 된다. KR-BERT와 같은 학습 데이터와 모델 구성을 유지하면서 감정 자질을 결합한 모델인 KR-BERT-KOSAC를 구현하고, 이를 GitHub을 통해 배포하였다. 또한 그로부터 학습 과정 내 언어 모델링과 감정 분석 과제에서의 성능을 얻은 뒤 KR-BERT와 비교하여 감정 자질 추가의 효과를 살펴보았다. 또한 감정 자질 중 극성과 강도 값을 각각 적용한 모델을 별도 구성하여 각 자질이 모델 성능 향상에 얼마나 기여하는지도 확인하였다. 이를 통해 두 가지 감정 자질을 모두 추가한 경우에, 그렇지 않은 다른 모델들에 비하여 언어 모델링이나 감정 분석 문제에서 성능이 어느 정도 향상되는 것을 관찰할 수 있었다. 이때 감정 분석 문제로는 영화평의 긍부정 여부 분류와 댓글의 악플 여부 분류를 포함하였다. 그런데 위와 같은 임베딩 모델을 사전학습하는 것은 많은 시간과 하드웨어 등의 자원을 요구한다. 따라서 본 연구에서는 비교적 적은 시간과 자원을 사용하는 간단한 모델 결합 방법을 제시한다. 적은 수의 인코더 레이어, 어텐션 헤드, 적은 임베딩 차원 수로 구성한 감정 자질 모델을 적은 스텝 수까지만 학습하고, 이를 기존에 큰 규모로 사전학습되어 있는 임베딩 모델과 결합한다. 기존의 사전학습모델에는 충분한 언어 모델링을 통해 다양한 언어 처리 문제를 처리할 수 있는 보편적인 기능이 기대되므로, 이러한 결합은 서로 다른 장점을 갖는 두 모델이 상호작용하여 더 우수한 자연어처리 능력을 갖도록 할 것이다. 본 연구에서는 감정 분석 문제들에 대한 실험을 통해 두 가지 모델의 결합이 학습 시간에 있어 효율적이면서도, 감정 자질을 더하지 않은 모델보다 더 정확한 예측을 할 수 있다는 것을 확인하였다.1 Introduction 1 1.1 Objectives 3 1.2 Contribution 9 1.3 Dissertation Structure 10 2 Related Work 13 2.1 Language Modeling and the Attention Mechanism 13 2.2 BERT-based Models 16 2.2.1 BERT and Variation Models 16 2.2.2 Korean-Specific BERT Models 19 2.2.3 Task-Specific BERT Models 22 2.3 Sentiment Analysis 24 2.4 Chapter Summary 30 3 BERT Architecture and Evaluations 33 3.1 Bidirectional Encoder Representations from Transformers (BERT) 33 3.1.1 Transformers and the Multi-Head Self-Attention Mechanism 34 3.1.2 Tokenization and Embeddings of BERT 39 3.1.3 Training and Fine-Tuning BERT 42 3.2 Evaluation of BERT 47 3.2.1 NLP Tasks 47 3.2.2 Metrics 50 3.3 Chapter Summary 52 4 Pre-Training of Korean BERT-based Model 55 4.1 The Need for a Korean Monolingual Model 55 4.2 Pre-Training Korean-specific BERT Model 58 4.3 Chapter Summary 70 5 Performances of Korean-Specific BERT Models 71 5.1 Task Datasets 71 5.1.1 Named Entity Recognition 71 5.1.2 Question Answering 73 5.1.3 Natural Language Inference 74 5.1.4 Semantic Textual Similarity 78 5.1.5 Sentiment Analysis 80 5.2 Experiments 81 5.2.1 Experiment Details 81 5.2.2 Task Results 83 5.3 Chapter Summary 89 6 An Extended Study to Sentiment Analysis 91 6.1 Sentiment Features 91 6.1.1 Sources of Sentiment Features 91 6.1.2 Assigning Prior Sentiment Values 94 6.2 Composition of Sentiment Embeddings 103 6.3 Training the Sentiment-Combined Model 109 6.4 Effect of Sentiment Features 113 6.5 Chapter Summary 121 7 Combining Two BERT Models 123 7.1 External Fusing Method 123 7.2 Experiments and Results 130 7.3 Chapter Summary 135 8 Conclusion 137 8.1 Summary of Contribution and Results 138 8.1.1 Construction of Korean Pre-trained BERT Models 138 8.1.2 Construction of a Sentiment-Combined Model 138 8.1.3 External Fusing of Two Pre-Trained Models to Gain Performance and Cost Advantages 139 8.2 Future Directions and Open Problems 140 8.2.1 More Training of KR-BERT-MEDIUM for Convergence of Performance 140 8.2.2 Observation of Changes Depending on the Domain of Training Data 141 8.2.3 Overlap of Sentiment Features with Linguistic Knowledge that BERT Learns 142 8.2.4 The Specific Process of Sentiment Features Helping the Language Modeling of BERT is Unknown 143 Bibliography 145 Appendices 157 A. Python Sources 157 A.1 Construction of Polarity and Intensity Embeddings 157 A.2 External Fusing of Different Pre-Trained Models 158 B. Examples of Experiment Outputs 162 C. Model Releases through GitHub 165Docto

SNU Open Repository and Archive

Natural Language Processing: Emerging Neural Approaches and Applications

Author
Publication venue: 'MDPI AG'
Publication date: 06/05/2022
Field of study

This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

Directory of Open Access Books (DOAB)