22 research outputs found

    Combining textual features with sentence embeddings

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2021.8. ๋ฐ•์ˆ˜์ง€.์ด ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ๋Š” ํ•œ๊ตญ์–ด ๊ธฐ์‚ฌ ํ’ˆ์งˆ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์–ธ์–ด ๋ชจํ˜•์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ธฐ์‚ฌ ํ’ˆ์งˆ ์˜ˆ์ธก ๊ณผ์ œ๋Š” ์ตœ๊ทผ ๊ฐ€์งœ๋‰ด์Šค ๋“ฑ์˜ ๋ฒ”๋žŒ์œผ๋กœ ๊ทธ ํ•„์š”์„ฑ์ด ๋Œ€๋‘๋˜๋ฉด์„œ๋„ ์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์ตœ์‹  ๊ธฐ๋ฒ•์ด ์•„์ง ์ ์šฉ๋˜์ง€ ๋ชปํ•˜๋Š” ์‹ค์ •์— ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ํ‘œ์ƒํ•˜๋Š” SBERT ๋ชจํ˜•์„ ๊ฐœ๋ฐœํ•˜๊ณ , ๊ธฐ์‚ฌ์˜ ์–ธ์–ดํ•™์  ์ž์งˆ์„ ํ™œ์šฉํ•˜์—ฌ ํ’ˆ์งˆ ๋ถ„๋ฅ˜์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ฒ€ํ† ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ธฐ์‚ฌ์˜ ๊ฐ€๋…์„ฑ, ์‘์ง‘์„ฑ ๋“ฑ์˜ ํ…์ŠคํŠธ ์ž์งˆ์„ ์‚ฌ์šฉํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจํ˜•๊ณผ SBERT์—์„œ ์ž๋™์œผ๋กœ ์ถ”์ถœ๋œ ๋ฌธ๋งฅ ์ž์งˆ์„ ์‚ฌ์šฉํ•œ ์ „์ดํ•™์Šต ๋ชจํ˜•์ด ๋ชจ๋‘ ์„ ํ–‰์—ฐ๊ตฌ์˜ ์‹ฌ์ธตํ•™์Šต ๊ฒฐ๊ณผ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” SBERT ํ•™์Šต์‹œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์žฅํ•˜๊ณ  ์ •์ œํ•  ๋•Œ, ๊ทธ๋ฆฌ๊ณ  ํ…์ŠคํŠธ ์ž์งˆ๊ณผ ๋ฌธ๋งฅ ์ž์งˆ์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ๋•Œ ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๊ด€์ธกํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์‚ฌ์˜ ํ’ˆ์งˆ์—์„œ ์–ธ์–ดํ•™์  ์ž์งˆ์ด ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋ฉฐ ์ž์—ฐ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์ตœ์‹  ๊ธฐ๋ฒ•์ธ SBERT๊ฐ€ ์–ธ์–ดํ•™์  ์ž์งˆ์„ ์ถ”์ถœํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ๋ฐ ์‹ค์งˆ์ ์œผ๋กœ ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋‹ค.1 Introduction 1 2 Literature Review 5 2.1 Background 5 2.1.1 Text Classification 5 2.1.1.1 Initial Studies 5 2.1.1.2 News Classification 6 2.1.2 Text Quality Assessment 8 2.2 News Quality Prediction Task 9 2.2.1 News Data 9 2.2.1.1 Online vs. Offline 9 2.2.1.2 Expert-rated vs. User-rated 9 2.2.2 Prediction Methods 11 2.2.2.1 Manually Engineered Features v. Automatically Extracted Features 11 2.2.2.2 Machine Learning vs. Deep Learning 12 2.3 Instruments and Techniques 14 2.3.1 Sentence and Document Embeddings 14 2.3.1.1 Static Embeddings 14 2.3.1.2 Contextual Embeddings 16 2.3.2 Fusion Models 18 2.4 Summary 27 3 Methods 29 3.1 Data from Choi, Shin, and Kang (2021) 29 3.1.1 News Corpus 29 3.1.2 Quality Levels 29 3.1.3 Journalism Values 30 3.2 Linguistic Features 31 3.2.1 Justification of Using Linguistic Features Only 31 3.2.2 Two Types of Linguistic Features 32 3.2.2.1 Textual Features 32 3.2.2.2 Contextual Features 33 3.3 Summary 33 4 Ordinal Logistic Regression Models with Textual Features 35 4.1 Textual Features 35 4.1.1 Coh-Metrix 35 4.1.2 KOSAC Lexicon 36 4.1.3 K-LIWC 38 4.1.4 Others 38 4.2 Ordinal Logistic Regression 38 4.3 Results 39 4.3.1 Feature Selection 39 4.3.2 Impacts on Quality Evaluation 40 4.4 Discussion 40 4.4.1 Effect of Cosine Similarity by Issue 41 4.4.2 Effect of Quantitative Evidence 47 4.4.3 Effect of Sentiment 48 4.5 Summary 51 5 Deep Transfer Learning Models with Contextual Features 53 5.1 Contextual Features from SentenceBERT 53 5.1.1 Necessity of Sentence Embeddings 54 5.1.2 KR-SBERT 55 5.2 Deep Transfer Learning 56 5.3 Results 59 5.3.1 Measures of Multiclass Classification 59 5.3.2 Performances of news quality prediction models 60 5.4 Discussion 62 5.4.1 Effect of Data Size 62 5.4.2 Effect of Data Augmentation 62 5.4.3 Effect of Data Refinement 635.5 Summary 63 6 Fusion Models Combining Textual Features with Contextual Sentence Embeddings 65 6.1 Model Fusion 65 6.1.1 Feature-level Fusion: Concatenation 65 6.1.2 Logit-level Fusion: Interpolation 65 6.2 Results 68 6.2.1 Optimization of the Presentational Attribute Model 68 6.2.2 Performances of News Quality Prediction Models 68 6.3 Discussion 68 6.3.1 Effects of Fusion 70 6.3.2 Comparison with Choi et al. (2021) 71 6.4 Summary 71 7 Conclusion 73 References 75 A List of Words Used for Textual Feature Extraction 93 A.1 Coh-Metrix Features 93 A.2 Predicate Type Features 94 B Codes Used in Chapter 4 97 B.1 Python Code for Textual Feature Extraction 97 C Results of VIF test and Brant test 101 C.1 VIF Test in R 101 C.2 Brant Test in R 103 D Codes Used in Chapter 6 107 D.1 Python Code for Feature-Level Fusion 107 D.2 Python Code for Logit-Level Fusion 108๋ฐ•

    ํŠธ์œ„ํ„ฐ ๊ฒŒ์‹œ๋ฌผ๊ฐ„ ๋ฐœํ™” ์—ฐ์‡„ยท๋‹ดํ™” ๋ถ„์ ˆ ํƒ์ง€ ๋ฐ ์งˆ์˜์–ด ๋น„ํฌํ•จ ํŠธ์œ— ๊ฒ€์ƒ‰์—์˜ ํ™œ์šฉ

    No full text
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์–ธ์–ดํ•™๊ณผ, 2014. 8. ์‹ ํšจํ•„.This thesis describes a phenomenon where multiple tweets constitute a single discourse segment, and builds two rule-based models to detect whether two consecutive tweets under the same authorship convey a single message. Given the length limit of 140 characters, a tweet should be interpreted as an element of a larger unit rather than an individual document. Considering such a larger unit as a discourse segment and a tweet as an utterance, this study makes the following assumptions based on Centering Theory: (a) A tweet has at most one topic. (b) In non-initial tweets of a discourse segment, a topic word is realized as an anaphora, in particular a zero form in Korean. (c) Coherence between two tweets written by the same author is considered only if there is no tweet between them. (d) In two consecutive tweets, a topic is preferred to be continued. To predict tweet serialization and discourse segmentation, two criteria were used: temporal proximity and discourse markers. Temporal proximity shows whether the time interval between two tweets is less than a threshold level, which can be a constant or user-specific value. Discourse markers are classified into continuation markers and shift markers. Continuation markers include web-specific ones such as `>>', `(continued)', and numbers, and linguistic ones such as conjunctions and referring expressions. Shift markers include web-specific ones such as `RT' and URLS, and linguistic ones such as interjections and temporal adverbs. These factors are treated differently in two different models. The Strict Serialization (SS) model regards two tweets as serialized only if their interval is extremely short or they have a continuation marker. On the contrary, the Serialization Plus Discourse Segmentation (SPDS) model, following the assumption (d) that continuation is preferred to shifting, considers two tweets as serialized if their interval is not too long, and terminates a discourse segment only if the current tweet has a shift marker. To verify whether the proposed models are useful, an information retrieval task is implemented. It is predicted by the assumption (b) and observed in the data that topic words were implicit in some tweets in discourse segments consisting of multiple tweets. The current search system cannot retrieve such tweets and thus fails to satisfy users' information need to find diverse opinions in Twitter. When finding discourse segments compiled by the proposed models, the system can retrieve tweets that belong to the same discourse segment as some explicitly relevant one, without retrieving too many irrelevant tweets. Consequently, the proposed models achieve higher means of precision rates than those of the Query Matching model and TF-IDF Weighting model. Furthermore, since the SPDS model outperforms the SS model, the principle of unmarkedness of topic continuation seems to be also valid for social media. Lastly, this thesis also discovers that linguistic markers such as interjections, which have been typically treated as stopwords in information retrieval, are useful for discourse segment detection.List of Tables vii List of Figures viii 1 Introduction 1 1.1 Subject 1 1.2 Purposes 2 1.2.1 Detection of discourse segments in Twitter data 2 1.2.2 Retrieval of tweets without an explicit query term 3 1.3 Structure 4 2 Previous Work 5 2.1 General NLP studies 5 2.1.1 On social media data 5 2.1.2 Using discourse knowledge 6 2.2 Task-specific studies 6 2.2.1 Finding a proper unit for unstructured short texts 6 2.2.2 Discourse markers in Twitter data 7 2.2.3 Classification of tweets without an overt topic word 8 2.2.4 Summary 9 3 Centering Theory 11 3.1 Overview 11 3.2 Major concepts used in this thesis 13 3.2.1 Uniqueness of the backward-looking center 13 3.2.2 Highest rank of zero pronouns as centers 14 3.2.3 Locality of coherence 15 3.2.4 Preference of center continuation 15 3.3 Summary 16 4 Tweet Serialization and Discourse Segmentation 17 4.1 Tweet Serialization 17 4.1.1 Phenomenon 17 4.1.2 Constraints 21 4.2 Discourse segmentation of serialized tweets 24 4.2.1 Two strategies for discourse segments detection 24 4.2.2 Discourse markers in tweets 28 4.2.3 Algorithm of the SDPS model 36 5 Retrieval of Implicitly Relevant Tweets 39 5.1 Overview: Current search system in Twitter 39 5.2 Data 41 5.2.1 Description 41 5.3 Baselines 43 5.3.1 Query matching model 43 5.3.2 Tf-idf Weighting model 44 5.4 Proposed models 46 5.4.1 Strict serialization Model 46 5.4.2 Serialization Plus Discourse Segmentation Model 47 5.5 Evaluation 49 5.5.1 Measures 49 5.5.2 Results 51 5.6 Discussion 57 5.6.1 Proper formalization of temporal proximity 57 5.6.2 Effects of discourse markers 59 6 Conclusion 61 Bibliography 63 ์ดˆ๋ก 73Maste

    Lactic acid bacteria(LAB)์„ธํฌ์— ์˜ํ•œ ์ง„์„ธ๋…ธ์‚ฌ์ด๋“œ์˜ ์ „ํ™˜ ์—ฐ๊ตฌ

    No full text
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์‹ํ’ˆ์˜์–‘ํ•™๊ณผ, 2012. 2. ์ง€๊ทผ์–ต.์ธ์‚ผ์˜ ์„ฑ๋ถ„ ์ค‘ ์˜ํ•™์  ๋ฐ ์ƒ๋ฌผํ•™์  ํ™œ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์€ ginsenoside ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์ธ์‚ผ ํŠน์œ ์˜ ์‚ฌํฌ๋‹Œ์ด๋‹ค. ์ด ์—ฐ๊ตฌ์˜ ๋ชฉ์ ์€ ginsenoside ๋ฅผ ์‹ํ’ˆ๋ฏธ์ƒ๋ฌผ์˜ ํšจ์†Œ์™€ ๋ฐ˜์‘์‹œ์ผœ ํŠน์ด์ ์ธ deglycosylated ginsenoside ๋ฅผ ์ƒ์‚ฐํ•˜๊ณ  ๊ทธ ํŠน์ง•์„ ๊ทœ๋ช…ํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์‹ํ’ˆ๋ฏธ์ƒ๋ฌผ์ธ Leuconostoc ๊ณผ Lactobacilli ์„ธํฌ์˜ ํšจ์†Œ์™€ ๋ฐ˜์‘์‹œ์ผœ major ginsenoside ๋ฅผ minor ginsenoside ๋กœ์˜ ์ „ํ™˜์„ ์‹œ๋„ํ•˜์˜€์œผ๋ฉฐ, ์ „ํ™˜ ๊ณผ์ • ์ค‘์˜ ์ตœ์ ์˜ ์ „ํ™˜์กฐ๊ฑด์„ ํ™•์ธํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด ์ค‘, Leuconostoc mesenteroides KFRI 690, Leuconostoc paramesenteroides KFRI 159, ๊ทธ๋ฆฌ๊ณ  Lactobacillus delbruckii KCCM 35486 ๊ท ์ฃผ๊ฐ€ ์ƒ๋ฌผ์ „ํ™˜์„ ํ†ตํ•ด compound K๋ฅผ ์ƒ์‚ฐํ•˜๋Š” ํ™œ์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ํŠนํžˆ, Leu. mesenteroides KFRI 690 ๊ท ์ฃผ์˜ ๊ฒฝ์šฐ ๋น„๋ฐฐ๋‹น์ฒด ํ˜•ํƒœ์˜ compound K๋กœ์˜ ์ „ํ™˜ํ™œ์„ฑ์ด ๊ฐ€์žฅ ๋†’์Œ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ, ์ „ํ™˜์กฐ๊ฑด์˜ ์ตœ์ ํ™” ์‹คํ—˜์„ ํ†ตํ•ด 2% sucrose๋ฅผ ์ฒจ๊ฐ€ํ•˜์—ฌ ๋ฐฐ์–‘์‹œํ‚จ ํ›„์˜ ํšจ์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ˜์‘๊ณผ์ •๋™์•ˆ pH 7.0, 37โ„ƒ์˜ ์ตœ์ ์กฐ๊ฑด์—์„œ ์ „ํ™˜์‹œ์ผฐ์„ ๊ฒฝ์šฐ compound K์˜ ์ƒ์‚ฐํšจ์œจ์ด ๊ฐ€์žฅ ๋†’์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ์ฒ˜์Œ์œผ๋กœ ์œ ์‚ฐ๊ท ์˜ ํŒŒ์‡„ํ•˜์ง€ ์•Š์€ ํšจ์†Œ์•ก์„ ์ด์šฉํ•˜์—ฌ ์ƒ๋ฌผ์ „ํ™˜ ์‹œํ‚จ ๊ฒƒ์— ์˜๋ฏธ๊ฐ€ ์žˆ์œผ๋ฉฐ, ์‹ํ’ˆ์‚ฐ์—…์—์„œ ์ž ์žฌ์  ๊ฒฝ์ œ์  ๊ฐ€์น˜๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ๋‹ค๊ณ  ๋ณด์—ฌ์ง„๋‹ค.Various strains of Lactobacillus and Leuconostoc species were evaluated to select the most promising strain to carry out transforming major ginsenosides into minor ginsenosides. Among the experimental lactic acid bacteria(LAB), Leuconostoc mesenteroides KFRI 690, Leuconostoc paramesenteroides KFRI 159, and Lactobacillus delbrueckii KCCM 35486 produced compound K from major ginsenoside precursors (Rb1, Rc, Rd and F2). KFRI 690 showed the best transforming activity among them. Furthermore, these LABs could biotransform ginsenosides without disrupting the cell to release enzyme activity. The production yield of compound K using KFRI 690 has been enhanced by adding 2% sucrose into the culture medium and incubating at pH 7.0 and 37โ„ƒ for 96 h during the transformation reaction. This is the first report on the production of compound K using whole cells of Leuconostoc mesenteroides, Leuconostoc paramesenteroides, and Lactobacillus delbrueckii, which are food grade lactic acid bacteria.Maste

    X-ray Imaging of wetting ridge on a soft solid

    No full text
    1

    Interface Dynamics of Soft Systems Studied by X-ray Imaging

    No full text
    DoctorA โ€œsoft systemโ€ is a system that consists of โ€œsoft materialsโ€ which are intermediate between crystalline solids and pure liquids. The soft materials are ubiquitous in our daily life as well as industries, in which a number of biological materials like cells, blood and soft tissues and a wide range of industrial products such as polymer gels, foam, and granular materials are involved. Because the โ€œsoftโ€ materials are โ€œmechanically weakโ€ or โ€œeasily deformedโ€, their mechanical responses against various external and internal stimuli have been focused on by many scientists and engineers. In particular, at an โ€œinterfaceโ€ in the soft system, distinctive static and dynamic behaviors, which cannot be understood by classical physics for general solids and liquids, have been observed or expected. However, interface phenomena, especially, interface dynamics, are not fully understood in many soft systems, mostly due to experimental restrictions in observation with conventional optical imaging techniques. In this thesis, the interface dynamics in two soft systems is studied: a droplet on a soft solid and cells in a multicellular organ. X-ray imaging is adapted to directly visualize the interfaces among soft solids, liquid, and vapor and plant cells in static and dynamic situation. In Chapter I, motivation of this thesis and current issues in interface dynamics research, particularly, for soft material systems. To comprehend various interface dynamics, direct visualization of the very interface region is necessary. Besides, the limitation of conventional techniques and the advantage of x-ray imaging techniques are suggested. In Chapter II, general contrast mechanisms in x-ray imaging, i.e. absorption- and phase contrast-based imaging, are explained and then, 3D tomography technique and reconstruction principle is presented. Chapter III and IV contain main results of this thesis about wetting on soft viscoelastic solids are presented. In this research, the heart of issues has been an accurate geometry of a microscopic protrusion formed by vertical component of liquid surface tension at the three-phase contact line, i.e., a โ€œwetting ridgeโ€, over a half of a century. Herein, for the first time, direct visualization of the wetting ridge is successfully achieved using transmission x-ray microscope. In Chapter III, based on the accurate measurement of the geometry of a ridge-tip, universal wetting principle is revealed, which can be applicable to describe general static wetting phenomena on various materials in a wide range of elasticity. Furthermore, in Chapter IV, with high temporal resolution of transmission x-ray microscopy, wetting ridge dynamics in three spreading behaviors is investigated as a pioneering work. In this study, the existence of two ridge growth stages is revealed and then, from real time movies recorded during spreading, pinning enhancement mechanism depending on the ridge geometry and the growth stages is investigated. In Chapter V, as a promising application of x-ray imaging technique to biological fields, a possible quantitative analysis of plant cell growth in multicellular organ is suggested using a 3D fast x-ray tomography technique. In particular, the expansion of individual cells in different cell layers of the root is clearly demonstrated. Interestingly, the individual cells show different anisotropic growth patterns that may generate internal tissue tensions or control the growth pattern of the organ. Finally, in Chapter VI, the total contents of this thesis is summarized.๊ณ ๋ถ„์ž, ์œ ์ฒด, ์ƒ์ฒด ์—ฐ์กฐ์ง ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ์—ฐ์ฒด ์‹œ์Šคํ…œ์€ ๊ทธ ๊ธฐ๊ณ„์  ๊ฐ•๋„๊ฐ€ ๊ณ ์ฒด์™€ ์•ก์ฒด์˜ ์ค‘๊ฐ„ ์ •๋„๋กœ ๋ฌด๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ๋ณ€ํ˜•์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ํŠน์„ฑ์„ ์ง€๋‹Œ๋‹ค. ํŠนํžˆ ๊ทธ ๊ณ„๋ฉด์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹ค์–‘ํ•œ ์—ญ๋™์ ์ธ ํ˜„์ƒ๋“ค์— ๋Œ€ํ•œ ์ดํ•ด๋Š” ๊ฐ์ข… ๋ฌผ๋ฆฌ์  ํ•ด์„๊ณผ ๊ทธ ์‘์šฉ์— ์žˆ์–ด ์ค‘์š”ํ•œ ๋น„์ค‘์„ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ์œผ๋‚˜, ๊ธฐ์กด์˜ ๊ฐ€์‹œ๊ด‘์„ ์„ ์ด์šฉํ•œ ์‹คํ—˜์  ์ ‘๊ทผ์€ ๊ณ„๋ฉด์˜ ํŠน์„ฑ์ƒ ๋‚˜ํƒ€๋‚˜๋Š” ๊ตด์ ˆ๊ณผ ๋ฐ˜์‚ฌ ๋“ฑ์˜ ๊ฐ์ข… ๊ด‘ํ•™ ํ˜„์ƒ์œผ๋กœ ์ธํ•ด ์ง์ ‘์ ์ธ ๊ด€์ฐฐ์ด ๋งค์šฐ ์–ด๋ ค์› ๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„๋ฉด์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํ˜„์ƒ๋“ค์— ๋Œ€ํ•œ ์ •ํ™•ํ•œ ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•˜์˜€๋‹ค. ๋ฐ˜๋ฉด, ๊ตด์ ˆ๋ฅ ์ด ๋งค์šฐ ์ž‘์€ ์—‘์Šค์„ ์€, ๊ณ„๋ฉด์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํŠน์œ ์˜ ์ž‘์€ ๊ตด์ ˆ์„ ์ด์šฉํ•˜์—ฌ ์œ„์ƒ ๋Œ€๋น„๋ฅผ ์–ป์–ด ๊ณ„๋ฉด ์˜์ƒํ™”์— ๋งค์šฐ ์œ ๋ฆฌํ•˜๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ€์‹œ๊ด‘์„ ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋Š” ์—‘์Šค์„  ์˜์ƒ ๊ธฐ๋ฒ•์„ ์—ฐ์ฒด ์‹œ์Šคํ…œ์— ์ ์šฉํ•จ์œผ๋กœ์จ ์—ญ๋™์ ์ธ ๊ณ„๋ฉด ํ˜„์ƒ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋†’์ด๊ณ ์ž ํ•˜์˜€๋‹ค. ๋จผ์ €, ์—ฐ์ฒด ์ƒ์—์„œ ์ผ์–ด๋‚˜๋Š” ์ –์Œ ํ˜„์ƒ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์—ฐ์ฒด์˜ ์ –์Œ ํ˜„์ƒ์€ ๋‹จ๋‹จํ•œ ๊ณ ์ฒด๋‚˜ ์•ก์ฒด์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์˜์˜ ๋ฒ•์น™์ด๋‚˜ ๋‰ด๋งŒ์˜ ๋ฒ•์น™์„ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜์—ฌ ์ดํ•ดํ•  ์ˆ˜ ์—†๋‹ค. ๋ฌด๋ฅธ ํ‘œ๋ฉด์ด ์•ก์ฒด์˜ ๊ณ„๋ฉด ์žฅ๋ ฅ์— ์˜ํ•ด ์ ‘์ด‰์„ ์ด ์œต๊ธฐํ•˜๋ฉด์„œ ์ –์Œ ์ฃผ๋ฆ„ (wetting ridge)์ด๋ผ๋Š” ๋…ํŠนํ•œ ๋ฏธ์„ธ ๊ตฌ์กฐ๋ฌผ์„ ํ˜•์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์—‘์Šค์„  ์˜์ƒ์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ดˆ๋กœ ์ง์ ‘ ์˜์ƒํ™”๋œ ์ –์Œ ์ฃผ๋ฆ„์„ ํ†ตํ•ด ์—ฐ์ฒด์˜ ์ •์  ์ –์Œ ํ˜„์ƒ์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ๋ฐํžˆ๊ณ , ๋‚˜์•„๊ฐ€ ์—ฐ์ฒด ์ƒ์—์„œ์˜ ๋…ํŠนํ•œ ํผ์ง ํ˜„์ƒ์— ๋Œ€ํ•œ ์‹ค์‹œ๊ฐ„ ๊ด€์ฐฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ธ ๊ฐ€์ง€ ํผ์ง ํ˜„์ƒ์˜ ๊ทผ๋ณธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์„ค๋ช…ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋‹ค์„ธํฌ ์‹๋ฌผ ๊ธฐ๊ด€ ๋‚ด์—์„œ์˜ ์‹๋ฌผ ์„ธํฌ์˜ 3์ฐจ์› ์„ฑ์žฅ์„ ์˜์ƒํ™”๋ฅผ ๋ชฉํ‘œ๋กœ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์‹๋ฌผ ๊ธฐ๊ด€์€ ์—ฌ๋Ÿฌ ์„ธํฌ์˜ ์ง‘ํ•ฉ์œผ๋กœ ๊ฐ ์„ธํฌ์˜ ์„ฑ์žฅ์€ ๊ธฐ๊ด€์˜ ์„ฑ์žฅ ๋ฐ ์›€์ง์ž„์„ ์กฐ์ ˆํ•œ๋‹ค. ํŠนํžˆ, 3์ฐจ์›์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” ์‹๋ฌผ์˜ ๊ฑฐ๋™์„ ์˜ฌ๋ฐ”๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์„ธํฌ ๋‹จ์œ„์˜ 3์ฐจ์› ์„ฑ์žฅ ํŒจํ„ด ์ดํ•ด๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ€์‹œ๊ด‘์„ ์€ ์ „์ฒด ๊ธฐ๊ด€์„ ์˜์ƒํ™” ํ•˜๊ธฐ์— ํˆฌ๊ณผ์„ฑ์ด ๋‚ฎ๊ณ  ์„ฑ์žฅํ•˜๋Š” ๊ธฐ๊ด€์„ ์‹ค์‹œ๊ฐ„ ๊ด€์ฐฐํ•˜๊ธฐ์— ์‹œ๊ฐ„ ํ•ด์ƒ๋„๊ฐ€ ๋–จ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ์‹๋ฌผ ๊ธฐ๊ด€์˜ ์„ฑ์žฅ์„ ์„ธํฌ ์„ฑ์žฅ ๋‹จ์œ„์—์„œ ์ดํ•ดํ•˜๋Š”๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์• ๊ธฐ ์žฅ๋Œ€ ๋ฟŒ๋ฆฌ์˜ ์„ฑ์žฅ์„ ์‹ค์‹œ๊ฐ„ 3์ฐจ์› ์˜์ƒํ™” ํ•˜์˜€์œผ๋ฉฐ, ๋ฟŒ๋ฆฌ ๋‚ด๋ถ€ ๊ฐ ์„ธํฌ์ธต์— ์กด์žฌํ•˜๋Š” ๊ฐœ๋ณ„ ์„ธํฌ์˜ ์„ฑ์žฅ์„ ์ •๋Ÿ‰ ๋ถ„์„ํ•จ์œผ๋กœ์จ ์„ธํฌ์˜ ๋ถ€ํ”ผํŒฝ์ฐฝ ํŒจํ„ด ์—ฐ๊ตฌ ๊ฐ€๋Šฅ์„ฑ๊ณผ ๊ทธ ์ค‘์š”์„ฑ์„ ํ™•์ธํ•˜์˜€๋‹ค

    Modality-based Sentiment Analysis through the Utilization of the Korean Sentiment Analysis Corpus

    No full text
    This study develops a practical application of language resources from the Korean Sentiment Analysis Corpus (KOSAC) for sentiment analysis research. With this in mind, based on their sentiment properties and the probabilistic factors of annotated expressions from KOSAC, we extracted annotated expressions and refined them to be a sentiment analysis research resource. This study attempted to break away from simple calculation methods dependant on the distribution of lexical polarity items seen in previous research. Additionally, in order to perform more sophisticated sentiment analysis, we attempted to introduce pragmatic information which includes modality. In order to achieve this, we cataloged expressions that include pragmatic information related to the speaker's attitude, based on their relative probability in KOSAC. After doing so, this study shows a practical application of this new language resource to subjectivity analysis research. When using this new resource, this research demonstrates an accuracy improvement of around 6%. This demonstrates very clearly that, in addition to polarity items, there exists a need to include a variety of aspects and lexical information when doing this type of research. Moreover, this extraction of sentiment expressions, depending on their semantic and pragmatic properties, not only shows an additional use of KOSAC, but also establishes a new resource in the field of sentiment analysis.N

    A Small-Scale Korean-Specific BERT Language Model

    No full text
    ์ตœ๊ทผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ๋ฌธ์žฅ ๋‹จ์œ„์˜ ์ž„๋ฒ ๋”ฉ์„ ์œ„ํ•œ ๋ชจ๋ธ๋“ค์€ ๊ฑฐ๋Œ€ํ•œ ๋ง๋ญ‰์น˜์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฐ ํ•˜๋“œ์›จ์–ด์™€ ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ•˜๊ณ  ํ•™์Šตํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค๋Š” ๋‹จ์ ์„ ๊ฐ–๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ๊ทœ๋ชจ๊ฐ€ ํฌ์ง€ ์•Š๋”๋ผ๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฝ์ œ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋ฉด์„œ ํ•„์ ํ• ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” ๋ชจ๋ธ์˜ ํ•„์š”์„ฑ์ด ์ œ๊ธฐ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์Œ์ ˆ ๋‹จ์œ„์˜ ํ•œ๊ตญ์–ด ์‚ฌ์ „, ์ž์†Œ ๋‹จ์œ„์˜ ํ•œ๊ตญ์–ด ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•˜๊ณ  ์ž์†Œ ๋‹จ์œ„์˜ ํ•™์Šต๊ณผ ์–‘๋ฐฉํ–ฅ WordPiece ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ƒˆ๋กญ๊ฒŒ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ธฐ์กด ๋ชจ๋ธ์˜ 1/10 ์‚ฌ์ด์ฆˆ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜๊ณ  ์ ์ ˆํ•œ ํฌ๊ธฐ์˜ ์‚ฌ์ „์„ ์‚ฌ์šฉํ•ด ๋” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๊ณ„์‚ฐ๋Ÿ‰์€ ์ค„๊ณ  ์„ฑ๋Šฅ์€ ๋น„์Šทํ•œ KR-BERT ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋กœ์จ ํ•œ๊ตญ์–ด์™€ ๊ฐ™์ด ๊ณ ์œ ์˜ ๋ฌธ์ž ์ฒด๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ํ˜•ํƒœ๋ก ์ ์œผ๋กœ ๋ณต์žกํ•˜๋ฉฐ ์ž์›์ด ์ ์€ ์–ธ์–ด์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•  ๋•Œ๋Š” ํ•ด๋‹น ์–ธ์–ด์— ํŠนํ™”๋œ ์–ธ์–ดํ•™์  ํ˜„์ƒ์„ ๋ฐ˜์˜ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.N
    corecore