thesis

A Mathematical Measurement For Korean Text Mining and Its Application

Abstract

Department of Mathematical SciencesIn modern society we are buried beneath an overwhelming amount of text data on the internet. We are less inclined to just surf the web and pass the time. To solve this problem, especially to grasp part and parcel of the text data we are presented, there have been numerous studies on the relationship between text data and the ease of the perception of the text???s meaning. However, most of the studies focused on English text data. Since most research did not take into account the linguistic characters, these same methods are not suitable for Korean text. Some special method is required to analyze Korean text data utilizing the characteristics of Korean. Thus we are proposing a new framework for Korean text mining in various texts via proper mathematical measurements. The framework is constructed with three parts: 1) text summarization 2) text clustering 3) relational text learning. Text summarization is the method of extracting the essential sentences from the text. As a measure of importance, we propose specific formulas which focus on the characteristics of Korean. These formulas will provide the input features for the fuzzy summarization system. However, this method has a significant defect for large data set. The number of the summarized sentences increases with the word count of a particular text. To solve this, we propose using text clustering. This field has been studied for a long time. It has a tradeo??? of accuracy for speed. Considering the syllable features of Asian linguistics, we have designed ???Syllable Vector??? as a new measurement. It has shown remarkable performance as implemented with text clustering, especially for high accuracy and speed through e???ectively reducing dimensions. Thirdly, we considered the relational feature of text data. The above concepts deal with the document itself. That is, text information has an independent relationship between documents. To handle these relations, we designed a new architecture for text learning using neural networks (NN). Recently, the most remarkable work in natural language processing (NLP) is ???word2vec???, which is built with artificial neural networks. Our proposed model has a learning structure of bipartite layers using meta information between text data, with a focus on citation relationships. This structure reflects the latent topic of the text using the quoted information. It can solve the shortcomings of the conventional system based on the term-document matrix.ope

    Similar works