13 research outputs found
온톨로지(Ontology)를 기반으로 하는 개념구조와 어휘기술
In this paper, I suggest a way of mapping the Mikrokosmos Ontology being developed at CRL (Computing Research Lab) of New Mexico State University, into lexical items. Many extensions to the Korean lexical meaning classifications have largely relied on the noun classifications and conceptual considerations. Those approaches, however, have proven to be insufficient in the case of Korean because lexical meaning classifications hardly fit in with conceptual structures. Along the same lines, simple meaning classification itself does not contribute to either theoretical linguistics or natural language processing. To resolve this problem, I introduce the language-independent conceptual structures, the Mikrokosmos ontology, which contain about 5,000 concepts encapsulating various lexical information in a frame. And I show how to map each individual sense of a word into the corresponding conceptual structure. Adopting this approach to Korean is an even greater challenge because Korean is such a polysemous complicated language. The ontology-based conceptual structure and its mapping to the lexical items allow us to disambiguate polysemous words and come up with unified ways of lexicon building
Some considerations on the analysis of linguistic data based on statistics
Much work has been done on the statistical text analysis. In many cases statistical methods have been applied without any considerations of distributional characteristics of texts. Asymptotic normality assumptions, for example, for some statistical methods have proven to be inappropriate in the case of corpus-based work, especially when rare events make up large fraction of data. This paper deals with the basis of statistics for quantitative analysis of text and suggests that appropriate statistical methods be chosen according to the characteristics of text and linguistic interpretations of statistical results be still required with a view to compensating for statistical limitations. This paper also describes some of widely used statistical methods such as t-test, chi-square, likelihood, and mutual information and points out characteristics of each methods.이 논문은 2002년도 정부〔교육인적자원부)의 재원으로 한국학술진흥재단의 지원을 받아 수행된 연구
입 (2002-074-AM1534)
Automatic Product Review Helpfulness Estimation based on Review Information Types
온라인 상품평 양의 비약적 증가로 인해 소비자들이 유용한 상품평 만을 찾는 것이 거의 불가능에 가까워졌다. 이 연구는 온라인 상품평의 유용성을 자동적으로 평가할 수 있는 토대를 마련하는데 그목적이 있다. 이를 위해 상품평을 이루는 문장에 담긴 정보를 설명하는 그 대상에 따라 종류를 나눌 수있도록 상품평 정보 분류를(Review Information Types) 제안하고, 각 정보 분류 내에서 문장의 주제 벡터 변환 방법과 군집화를 이용하여 더 세부적으로 각 문장이 어떤 정보를 제공하는지를 추출함으로써 각상품평이 제공하는 정보에 따라 그 유용성을 평가하는 방법을 제안한다. 이러한 시도는 잠재적 소비자들이상품평에서 상품 자체의 특성이나 상품평 제공자의 경험과 같은 정보를 배송과 같은 정보보다 중요하게생각할 것이라는 가정에서 시작했다. 자동 상품평 유용성 평가 실험을 통해 본 연구에서 제시하는 방법이기존의 비교 가능한 연구들에 비해 더 효과적인 것을 밝혀냈다.
Many available online product reviews for any given product makes it difficult for a consumer to locate the helpful reviews. The purpose of this study was to investigate automatic helpfulness evaluation of online product reviews according to review information types based on the target of information. The underlying assumption was that consumers find reviews containing specific information related to the product itself or the reliability of reviewers more helpful than peripheral information, such as shipping or customer service. Therefore, each sentence was categorized by given information types, which reduced the semantic space of review sentences. Subsequently, we extracted specific information from sentences by using a topic-based representation of the sentences and a clustering algorithm. Review ranking experiments indicated more effective results than other comparable approaches.N
Modality-based Sentiment Analysis through the Utilization of the Korean Sentiment Analysis Corpus
This study develops a practical application of language resources from the Korean Sentiment Analysis Corpus (KOSAC) for sentiment analysis research. With this in mind, based on their sentiment properties and the probabilistic factors of annotated expressions from KOSAC, we extracted annotated expressions and refined them to be a sentiment analysis research resource. This study attempted to break away from simple calculation methods dependant on the distribution of lexical polarity items seen in previous research. Additionally, in order to perform more sophisticated sentiment analysis, we attempted to introduce pragmatic information which includes modality. In order to achieve this, we cataloged expressions that include pragmatic information related to the speaker's attitude, based on their relative probability in KOSAC. After doing so, this study shows a practical application of this new language resource to subjectivity analysis research. When using this new resource, this research demonstrates an accuracy improvement of around 6%. This demonstrates very clearly that, in addition to polarity items, there exists a need to include a variety of aspects and lexical information when doing this type of research. Moreover, this extraction of sentiment expressions, depending on their semantic and pragmatic properties, not only shows an additional use of KOSAC, but also establishes a new resource in the field of sentiment analysis.N
Stance Classification of Online Debate Texts based on Discourse Relations
Recently, there is an increasing demand for the analysis of mass opinions using online text data. In particular, many studies have focused on automatic recognition of the main idea of subjective, argumentative writing. Additionally, such automatization of this task is fast becoming indispensable. This study constructed text data using debates in Korean on certain political issues, and attempted to identify the stance that each text supports about a given topic. We collected words which support one stance over the other and used them as the features for a machine learning classifier with a dictionary of sentiment words annotated based on their polarities. We then calculated weights for each sub-unit in the text based on the relevant discourse relations. Our classifier resulted in a slight improvement with respect to the defined weights
The Analysis of the Impact of Tokenization of Korean Pre-trained Model on Sentence Embedding
The pre-trained models leading the field of natural language processing these days perform tokenization that do not consider linguistic units, such as Byte-Pair Encoding, WordPiece, or SentencePiece. While these methods alleviate the OOV(Out of Vocabulary) problem, they generate many tokens that have lost their lexical meaning by splitting words into smaller units. This paper analyzes how these tokens affect sentence embedding and ultimately points out the limitations of tokenization of the pre-trained models in this regard. To this end, this study conducts an experiment to determine how tokens interact with sentence embedding depending on whether they preserve their semantics. The interaction between tokens and sentence embedding is measured by Self-Similarity and Intra-Similarity proposed by Ethayarajh(2019). This study found that tokens without semantics show both low Self-Similarity and Intra-Similarity while the other tokens reached a high level in terms of both indicators. Through analysis of the word embedding layer and Self-Attention layer, this study concludes that the former lead to bias in sentence embedding, which is a problem that the pre-trained models inevitably suffer from as long as they continue with existing tokenization.N
Analysis on the Elements of the Decision Boundaries of Linguistic Acceptability in Language Model using Affinity Prober
Recently, many studies endeavor to reveal the intrinsic nature of BERT, since transformer-based language models have achieved the state-of-the-art in many natural language understanding tasks. However, it is still hard to probe BERTs universal linguistic properties since different probing methods lead to fluctuating results between tasks and models. Thus, this paper suggests Affinity Prober, which is a flexible task- and model-agnostic probing method to investigate transformer-based language models decision boundaries when processing linguistic phenomena. Affinity Prober is designed to investigate potential linguistic knowledge from the Self Attention Mechanism. Using Affinity Prober, the study examines whether a bert-base-cased model has any explainable decision boundaries on sentence acceptability in terms of a lexical category in English. The results in syntactic phenomena show that the Affinity Relationship between function words is reinforced in the upper layers, while the Affinity Relationship between content words plays a key role in semantic phenomena. The study concludes that Affinity Prober is helpful to analyze the models decision boundary with lexical categories on specific linguistic phenomena.N
Contract Eligibility Verification Enhanced by Keyword and Contextual Embeddings
최근에는 계약서를 포함한 법률 문서들을 대량으로, 빠르고 정확하게 처리하기 위하여 인공지능을 활용한 자동화된 분석 방법이 요구된다. 계약서는 그 안에 필수적인 조항들이 모두 포함되었는지, 어느 한 쪽에 불리한 조항은 없는지 등을 확인하여 적격성을 검증할 수 있다. 이때 계약서를 이루는 조항들은 계약서의 종류와 관계없이 매우 정형적이고 반복적인 경우가 많다. 본 연구에서는 이러한 성격을 이용하여 계약서 내 조항별 분류 모델을 구축하였으며, 계약서의 관습적인 요구사항에 기반하여 구성한 키워드 임베딩을 구축하고 이를 BERT 임베딩과 결합하여 사용한다. 이때 BERT 모델은 한국어 사전학습모델을 법률 도메인 문서를 이용하여 미세 조정한 것이다. 각 조항의 분류 결과는 정확도 90.57과 90.64, F1 점수 93.27과 93.26으로 우수한 수준이며, 이렇게 계약서를 이루는 각 조항이 어떤 필수조항에 해당되는지의 예측 결과를 통해 계약서의 적격성을 검증할 수 있다.N
A Small-Scale Korean-Specific BERT Language Model
최근 자연어처리에서 문장 단위의 임베딩을 위한 모델들은 거대한 말뭉치와 파라미터를 이용하기 때문에 큰 하드웨어와 데이터를 요구하고 학습하는 데 시간이 오래 걸린다는 단점을 갖는다. 따라서 규모가 크지 않더라도 학습 데이터를 경제적으로 활용하면서 필적할만한 성능을 가지는 모델의 필요성이 제기된다. 본 연구는 음절 단위의 한국어 사전, 자소 단위의 한국어 사전을 구축하고 자소 단위의 학습과 양방향 WordPiece 토크나이저를 새롭게 소개하였다. 그 결과 기존 모델의 1/10 사이즈의 학습 데이터를 이용하고 적절한 크기의 사전을 사용해 더 적은 파라미터로 계산량은 줄고 성능은 비슷한 KR-BERT 모델을 구현할 수 있었다. 이로써 한국어와 같이 고유의 문자 체계를 가지고 형태론적으로 복잡하며 자원이 적은 언어에 대해 모델을 구축할 때는 해당 언어에 특화된 언어학적 현상을 반영해야 한다는 것을 확인하였다.
Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.N
