60 research outputs found

    ํ•œ๊ตญ์–ด ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ ๊ตฌ์ถ•๊ณผ ํ™•์žฅ ์—ฐ๊ตฌ: ๊ฐ์ •๋ถ„์„์„ ์ค‘์‹ฌ์œผ๋กœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2021. 2. ์‹ ํšจํ•„.Recently, as interest in the Bidirectional Encoder Representations from Transformers (BERT) model has increased, many studies have also been actively conducted in Natural Language Processing based on the model. Such sentence-level contextualized embedding models are generally known to capture and model lexical, syntactic, and semantic information in sentences during training. Therefore, such models, including ELMo, GPT, and BERT, function as a universal model that can impressively perform a wide range of NLP tasks. This study proposes a monolingual BERT model trained based on Korean texts. The first released BERT model that can handle the Korean language was Google Researchโ€™s multilingual BERT (M-BERT), which was constructed with training data and a vocabulary composed of 104 languages, including Korean and English, and can handle the text of any language contained in the single model. However, despite the advantages of multilingualism, this model does not fully reflect each languageโ€™s characteristics, so that its text processing performance in each language is lower than that of a monolingual model. While mitigating those shortcomings, we built monolingual models using the training data and a vocabulary organized to better capture Korean textsโ€™ linguistic knowledge. Therefore, in this study, a model named KR-BERT was built using training data composed of Korean Wikipedia text and news articles, and was released through GitHub so that it could be used for processing Korean texts. Additionally, we trained a KR-BERT-MEDIUM model based on expanded data by adding comments and legal texts to the training data of KR-BERT. Each model used a list of tokens composed mainly of Hangul characters as its vocabulary, organized using WordPiece algorithms based on the corresponding training data. These models reported competent performances in various Korean NLP tasks such as Named Entity Recognition, Question Answering, Semantic Textual Similarity, and Sentiment Analysis. In addition, we added sentiment features to the BERT model to specialize it to better function in sentiment analysis. We constructed a sentiment-combined model including sentiment features, where the features consist of polarity and intensity values assigned to each token in the training data corresponding to that of Korean Sentiment Analysis Corpus (KOSAC). The sentiment features assigned to each token compose polarity and intensity embeddings and are infused to the basic BERT input embeddings. The sentiment-combined model is constructed by training the BERT model with these embeddings. We trained a model named KR-BERT-KOSAC that contains sentiment features while maintaining the same training data, vocabulary, and model configurations as KR-BERT and distributed it through GitHub. Then we analyzed the effects of using sentiment features in comparison to KR-BERT by observing their performance in language modeling during the training process and sentiment analysis tasks. Additionally, we determined how much each of the polarity and intensity features contributes to improving the model performance by separately organizing a model that utilizes each of the features, respectively. We obtained some increase in language modeling and sentiment analysis performances by using both the sentiment features, compared to other models with different feature composition. Here, we included the problems of binary positivity classification of movie reviews and hate speech detection on offensive comments as the sentiment analysis tasks. On the other hand, training these embedding models requires a lot of training time and hardware resources. Therefore, this study proposes a simple model fusing method that requires relatively little time. We trained a smaller-scaled sentiment-combined model consisting of a smaller number of encoder layers and attention heads and smaller hidden sizes for a few steps, combining it with an existing pre-trained BERT model. Since those pre-trained models are expected to function universally to handle various NLP problems based on good language modeling, this combination will allow two models with different advantages to interact and have better text processing capabilities. In this study, experiments on sentiment analysis problems have confirmed that combining the two models is efficient in training time and usage of hardware resources, while it can produce more accurate predictions than single models that do not include sentiment features.์ตœ๊ทผ ํŠธ๋žœ์Šคํฌ๋จธ ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋” ํ‘œํ˜„ (Bidirectional Encoder Representations from Transformers, BERT) ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๋†’์•„์ง€๋ฉด์„œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ์ด์— ๊ธฐ๋ฐ˜ํ•œ ์—ฐ๊ตฌ ์—ญ์‹œ ํ™œ๋ฐœํžˆ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์žฅ ๋‹จ์œ„์˜ ์ž„๋ฒ ๋”ฉ์„ ์œ„ํ•œ ๋ชจ๋ธ๋“ค์€ ๋ณดํ†ต ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์žฅ ๋‚ด ์–ดํœ˜, ํ†ต์‚ฌ, ์˜๋ฏธ ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ELMo, GPT, BERT ๋“ฑ์€ ๊ทธ ์ž์ฒด๊ฐ€ ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ณดํŽธ์ ์ธ ๋ชจ๋ธ๋กœ์„œ ๊ธฐ๋Šฅํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ํ•œ๊ตญ์–ด ์ž๋ฃŒ๋กœ ํ•™์Šตํ•œ ๋‹จ์ผ ์–ธ์–ด BERT ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๊ฐ€์žฅ ๋จผ์ € ๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT ๋ชจ๋ธ์€ Google Research์˜ multilingual BERT (M-BERT)์˜€๋‹ค. ์ด๋Š” ํ•œ๊ตญ์–ด์™€ ์˜์–ด๋ฅผ ํฌํ•จํ•˜์—ฌ 104๊ฐœ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์–ดํœ˜ ๋ชฉ๋ก์„ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋ฉฐ, ๋ชจ๋ธ ํ•˜๋‚˜๋กœ ํฌํ•จ๋œ ๋ชจ๋“  ์–ธ์–ด์˜ ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ๊ทธ ๋‹ค์ค‘์–ธ์–ด์„ฑ์ด ๊ฐ–๋Š” ์žฅ์ ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๊ฐ ์–ธ์–ด์˜ ํŠน์„ฑ์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜์—ฌ ๋‹จ์ผ ์–ธ์–ด ๋ชจ๋ธ๋ณด๋‹ค ๊ฐ ์–ธ์–ด์˜ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค๋Š” ๋‹จ์ ์„ ๋ณด์ธ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ทธ๋Ÿฌํ•œ ๋‹จ์ ๋“ค์„ ์™„ํ™”ํ•˜๋ฉด์„œ ํ…์ŠคํŠธ์— ํฌํ•จ๋˜์–ด ์žˆ๋Š” ์–ธ์–ด ์ •๋ณด๋ฅผ ๋ณด๋‹ค ์ž˜ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์™€ ์–ดํœ˜ ๋ชฉ๋ก์„ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํ•œ๊ตญ์–ด Wikipedia ํ…์ŠคํŠธ์™€ ๋‰ด์Šค ๊ธฐ์‚ฌ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ KR-BERT ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๊ณ , ์ด๋ฅผ GitHub์„ ํ†ตํ•ด ๊ณต๊ฐœํ•˜์—ฌ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ ํ•ด๋‹น ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ“๊ธ€ ๋ฐ์ดํ„ฐ์™€ ๋ฒ•์กฐ๋ฌธ๊ณผ ํŒ๊ฒฐ๋ฌธ์„ ๋ง๋ถ™์—ฌ ํ™•์žฅํ•œ ํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•ด์„œ ๋‹ค์‹œ KR-BERT-MEDIUM ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์˜€๋‹ค. ์ด ๋ชจ๋ธ์€ ํ•ด๋‹น ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ WordPiece ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด ๊ตฌ์„ฑํ•œ ํ•œ๊ธ€ ์ค‘์‹ฌ์˜ ํ† ํฐ ๋ชฉ๋ก์„ ์‚ฌ์ „์œผ๋กœ ์ด์šฉํ•˜์˜€๋‹ค. ์ด๋“ค ๋ชจ๋ธ์€ ๊ฐœ์ฒด๋ช… ์ธ์‹, ์งˆ์˜์‘๋‹ต, ๋ฌธ์žฅ ์œ ์‚ฌ๋„ ํŒ๋‹จ, ๊ฐ์ • ๋ถ„์„ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฌธ์ œ์— ์ ์šฉ๋˜์–ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด๊ณ ํ–ˆ๋‹ค. ๋˜ํ•œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” BERT ๋ชจ๋ธ์— ๊ฐ์ • ์ž์งˆ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ทธ๊ฒƒ์ด ๊ฐ์ • ๋ถ„์„์— ํŠนํ™”๋œ ๋ชจ๋ธ๋กœ์„œ ํ™•์žฅ๋œ ๊ธฐ๋Šฅ์„ ํ•˜๋„๋ก ํ•˜์˜€๋‹ค. ๊ฐ์ • ์ž์งˆ์„ ํฌํ•จํ•˜์—ฌ ๋ณ„๋„์˜ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ๋Š”๋ฐ, ์ด๋•Œ ๊ฐ์ • ์ž์งˆ์€ ๋ฌธ์žฅ ๋‚ด์˜ ๊ฐ ํ† ํฐ์— ํ•œ๊ตญ์–ด ๊ฐ์ • ๋ถ„์„ ์ฝ”ํผ์Šค (KOSAC)์— ๋Œ€์‘ํ•˜๋Š” ๊ฐ์ • ๊ทน์„ฑ(polarity)๊ณผ ๊ฐ•๋„(intensity) ๊ฐ’์„ ๋ถ€์—ฌํ•œ ๊ฒƒ์ด๋‹ค. ๊ฐ ํ† ํฐ์— ๋ถ€์—ฌ๋œ ์ž์งˆ์€ ๊ทธ ์ž์ฒด๋กœ ๊ทน์„ฑ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ•๋„ ์ž„๋ฒ ๋”ฉ์„ ๊ตฌ์„ฑํ•˜๊ณ , BERT๊ฐ€ ๊ธฐ๋ณธ์œผ๋กœ ํ•˜๋Š” ํ† ํฐ ์ž„๋ฒ ๋”ฉ์— ๋”ํ•ด์ง„๋‹ค. ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•œ ๊ฒƒ์ด ๊ฐ์ • ์ž์งˆ ๋ชจ๋ธ(sentiment-combined model)์ด ๋œ๋‹ค. KR-BERT์™€ ๊ฐ™์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ๊ตฌ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ฐ์ • ์ž์งˆ์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์ธ KR-BERT-KOSAC๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ , ์ด๋ฅผ GitHub์„ ํ†ตํ•ด ๋ฐฐํฌํ•˜์˜€๋‹ค. ๋˜ํ•œ ๊ทธ๋กœ๋ถ€ํ„ฐ ํ•™์Šต ๊ณผ์ • ๋‚ด ์–ธ์–ด ๋ชจ๋ธ๋ง๊ณผ ๊ฐ์ • ๋ถ„์„ ๊ณผ์ œ์—์„œ์˜ ์„ฑ๋Šฅ์„ ์–ป์€ ๋’ค KR-BERT์™€ ๋น„๊ตํ•˜์—ฌ ๊ฐ์ • ์ž์งˆ ์ถ”๊ฐ€์˜ ํšจ๊ณผ๋ฅผ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋˜ํ•œ ๊ฐ์ • ์ž์งˆ ์ค‘ ๊ทน์„ฑ๊ณผ ๊ฐ•๋„ ๊ฐ’์„ ๊ฐ๊ฐ ์ ์šฉํ•œ ๋ชจ๋ธ์„ ๋ณ„๋„ ๊ตฌ์„ฑํ•˜์—ฌ ๊ฐ ์ž์งˆ์ด ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ•˜๋Š”์ง€๋„ ํ™•์ธํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๊ฐ์ • ์ž์งˆ์„ ๋ชจ๋‘ ์ถ”๊ฐ€ํ•œ ๊ฒฝ์šฐ์—, ๊ทธ๋ ‡์ง€ ์•Š์€ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์— ๋น„ํ•˜์—ฌ ์–ธ์–ด ๋ชจ๋ธ๋ง์ด๋‚˜ ๊ฐ์ • ๋ถ„์„ ๋ฌธ์ œ์—์„œ ์„ฑ๋Šฅ์ด ์–ด๋Š ์ •๋„ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋•Œ ๊ฐ์ • ๋ถ„์„ ๋ฌธ์ œ๋กœ๋Š” ์˜ํ™”ํ‰์˜ ๊ธ๋ถ€์ • ์—ฌ๋ถ€ ๋ถ„๋ฅ˜์™€ ๋Œ“๊ธ€์˜ ์•…ํ”Œ ์—ฌ๋ถ€ ๋ถ„๋ฅ˜๋ฅผ ํฌํ•จํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์œ„์™€ ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์ „ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ํ•˜๋“œ์›จ์–ด ๋“ฑ์˜ ์ž์›์„ ์š”๊ตฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋น„๊ต์  ์ ์€ ์‹œ๊ฐ„๊ณผ ์ž์›์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ ๊ฒฐํ•ฉ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ ์€ ์ˆ˜์˜ ์ธ์ฝ”๋” ๋ ˆ์ด์–ด, ์–ดํ…์…˜ ํ—ค๋“œ, ์ ์€ ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ˆ˜๋กœ ๊ตฌ์„ฑํ•œ ๊ฐ์ • ์ž์งˆ ๋ชจ๋ธ์„ ์ ์€ ์Šคํ… ์ˆ˜๊นŒ์ง€๋งŒ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ์กด์— ํฐ ๊ทœ๋ชจ๋กœ ์‚ฌ์ „ํ•™์Šต๋˜์–ด ์žˆ๋Š” ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ธฐ์กด์˜ ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ์—๋Š” ์ถฉ๋ถ„ํ•œ ์–ธ์–ด ๋ชจ๋ธ๋ง์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ฒ˜๋ฆฌ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ณดํŽธ์ ์ธ ๊ธฐ๋Šฅ์ด ๊ธฐ๋Œ€๋˜๋ฏ€๋กœ, ์ด๋Ÿฌํ•œ ๊ฒฐํ•ฉ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์žฅ์ ์„ ๊ฐ–๋Š” ๋‘ ๋ชจ๋ธ์ด ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ๋” ์šฐ์ˆ˜ํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์„ ๊ฐ–๋„๋ก ํ•  ๊ฒƒ์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ์ • ๋ถ„์„ ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉ์ด ํ•™์Šต ์‹œ๊ฐ„์— ์žˆ์–ด ํšจ์œจ์ ์ด๋ฉด์„œ๋„, ๊ฐ์ • ์ž์งˆ์„ ๋”ํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ๋ณด๋‹ค ๋” ์ •ํ™•ํ•œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Objectives 3 1.2 Contribution 9 1.3 Dissertation Structure 10 2 Related Work 13 2.1 Language Modeling and the Attention Mechanism 13 2.2 BERT-based Models 16 2.2.1 BERT and Variation Models 16 2.2.2 Korean-Specific BERT Models 19 2.2.3 Task-Specific BERT Models 22 2.3 Sentiment Analysis 24 2.4 Chapter Summary 30 3 BERT Architecture and Evaluations 33 3.1 Bidirectional Encoder Representations from Transformers (BERT) 33 3.1.1 Transformers and the Multi-Head Self-Attention Mechanism 34 3.1.2 Tokenization and Embeddings of BERT 39 3.1.3 Training and Fine-Tuning BERT 42 3.2 Evaluation of BERT 47 3.2.1 NLP Tasks 47 3.2.2 Metrics 50 3.3 Chapter Summary 52 4 Pre-Training of Korean BERT-based Model 55 4.1 The Need for a Korean Monolingual Model 55 4.2 Pre-Training Korean-specific BERT Model 58 4.3 Chapter Summary 70 5 Performances of Korean-Specific BERT Models 71 5.1 Task Datasets 71 5.1.1 Named Entity Recognition 71 5.1.2 Question Answering 73 5.1.3 Natural Language Inference 74 5.1.4 Semantic Textual Similarity 78 5.1.5 Sentiment Analysis 80 5.2 Experiments 81 5.2.1 Experiment Details 81 5.2.2 Task Results 83 5.3 Chapter Summary 89 6 An Extended Study to Sentiment Analysis 91 6.1 Sentiment Features 91 6.1.1 Sources of Sentiment Features 91 6.1.2 Assigning Prior Sentiment Values 94 6.2 Composition of Sentiment Embeddings 103 6.3 Training the Sentiment-Combined Model 109 6.4 Effect of Sentiment Features 113 6.5 Chapter Summary 121 7 Combining Two BERT Models 123 7.1 External Fusing Method 123 7.2 Experiments and Results 130 7.3 Chapter Summary 135 8 Conclusion 137 8.1 Summary of Contribution and Results 138 8.1.1 Construction of Korean Pre-trained BERT Models 138 8.1.2 Construction of a Sentiment-Combined Model 138 8.1.3 External Fusing of Two Pre-Trained Models to Gain Performance and Cost Advantages 139 8.2 Future Directions and Open Problems 140 8.2.1 More Training of KR-BERT-MEDIUM for Convergence of Performance 140 8.2.2 Observation of Changes Depending on the Domain of Training Data 141 8.2.3 Overlap of Sentiment Features with Linguistic Knowledge that BERT Learns 142 8.2.4 The Specific Process of Sentiment Features Helping the Language Modeling of BERT is Unknown 143 Bibliography 145 Appendices 157 A. Python Sources 157 A.1 Construction of Polarity and Intensity Embeddings 157 A.2 External Fusing of Different Pre-Trained Models 158 B. Examples of Experiment Outputs 162 C. Model Releases through GitHub 165Docto

    A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

    Full text link
    We address the design of a unified multilingual system for handwriting recognition. Most of multi- lingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by con- sidering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes pos- sible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods perfor- mance with a strong reduction of the complexity.Comment: preprin

    Script Effects as the Hidden Drive of the Mind, Cognition, and Culture

    Get PDF
    This open access volume reveals the hidden power of the script we read in and how it shapes and drives our minds, ways of thinking, and cultures. Expanding on the Linguistic Relativity Hypothesis (i.e., the idea that language affects the way we think), this volume proposes the โ€œScript Relativity Hypothesisโ€ (i.e., the idea that the script in which we read affects the way we think) by offering a unique perspective on the effect of script (alphabets, morphosyllabaries, or multi-scripts) on our attention, perception, and problem-solving. Once we become literate, fundamental changes occur in our brain circuitry to accommodate the new demand for resources. The powerful effects of literacy have been demonstrated by research on literate versus illiterate individuals, as well as cross-scriptal transfer, indicating that literate brain networks function differently, depending on the script being read. This book identifies the locus of differences between the Chinese, Japanese, and Koreans, and between the East and the West, as the neural underpinnings of literacy. To support the โ€œScript Relativity Hypothesisโ€, it reviews a vast corpus of empirical studies, including anthropological accounts of human civilization, social psychology, cognitive psychology, neuropsychology, applied linguistics, second language studies, and cross-cultural communication. It also discusses the impact of reading from screens in the digital age, as well as the impact of bi-script or multi-script use, which is a growing trend around the globe. As a result, our minds, ways of thinking, and cultures are now growing closer together, not farther apart. ; Examines the origin, emergence, and co-evolution of written language, the human mind, and culture within the purview of script effects Investigates how the scripts we read over time shape our cognition, mind, and thought patterns Provides a new outlook on the four representative writing systems of the world Discusses the consequences of literacy for the functioning of the min

    Investigating Multilingual, Multi-script Support in Lucene/Solr Library Applications

    Get PDF
    Yale has developed over many years a highly-structured, high-quality multilingual catalog of bibliographic data. Almost 50% of the collection represents non-English materials in over 650 languages, and includes many different non-Roman scripts. Faculty, students, researchers, and staff would like to make full use of this original script content for resource discovery. While the underlying textual data are in place, effective indexing, retrieval and display functionality for the non-Roman script content is not available within our bibliographic discovery applications, Orbis and Yufind. Opportunities now exist in the Unicode, Lucene/Solr computing environment to bridge the functionality gap and achieve internationalization of the Yale Library catalog. While most parts of this study focus on the Yale environment, in the absence of other such studies it is hoped that the findings will be of interest to a much larger community.Arcadia Foundatio

    No flexibility in letter position coding in Korean

    Get PDF
    Substantial research across Indo-European languages suggests that readers display a degree of uncertainty in letter position coding. For example, readers perceive transposed-letter stimuli, such as jugde, as similar to their base words (e.g., judge). However, tolerance to disruptions of letter order is not apparent in all languages, suggesting that critical aspects of the writing system may shape the nature of position coding. We investigated readers' tolerance to these disruptions in Korean, a writing system characterized by a high degree of orthographic confusability. Results of three Korean masked priming experiments revealed robust identity priming effects, but no indication of priming due to shared letters or syllables in different positions. Two further masked priming experiments revealed where the Korean findings deviate from English. These results support the claim that the nature of the writing system influences the precision of orthographic representations used in reading. (PsycINFO Database Record (c) 2019 APA, all rights reserved)

    ์Œ์„ฑ์–ธ์–ด ์ดํ•ด์—์„œ์˜ ์ค‘์˜์„ฑ ํ•ด์†Œ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ๊น€๋‚จ์ˆ˜.์–ธ์–ด์˜ ์ค‘์˜์„ฑ์€ ํ•„์—ฐ์ ์ด๋‹ค. ๊ทธ๊ฒƒ์€ ์–ธ์–ด๊ฐ€ ์˜์‚ฌ ์†Œํ†ต์˜ ์ˆ˜๋‹จ์ด์ง€๋งŒ, ๋ชจ๋“  ์‚ฌ๋žŒ์ด ์ƒ๊ฐํ•˜๋Š” ์–ด๋–ค ๊ฐœ๋…์ด ์™„๋ฒฝํžˆ ๋™์ผํ•˜๊ฒŒ ์ „๋‹ฌ๋  ์ˆ˜ ์—†๋Š” ๊ฒƒ์— ๊ธฐ์ธํ•œ๋‹ค. ์ด๋Š” ํ•„์—ฐ์ ์ธ ์š”์†Œ์ด๊ธฐ๋„ ํ•˜์ง€๋งŒ, ์–ธ์–ด ์ดํ•ด์—์„œ ์ค‘์˜์„ฑ์€ ์ข…์ข… ์˜์‚ฌ ์†Œํ†ต์˜ ๋‹จ์ ˆ์ด๋‚˜ ์‹คํŒจ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ๋„ ํ•œ๋‹ค. ์–ธ์–ด์˜ ์ค‘์˜์„ฑ์—๋Š” ๋‹ค์–‘ํ•œ ์ธต์œ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋  ํ•„์š”๋Š” ์—†๋‹ค. ํƒœ์Šคํฌ๋งˆ๋‹ค, ๋„๋ฉ”์ธ๋งˆ๋‹ค ๋‹ค๋ฅธ ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ์กด์žฌํ•˜๋ฉฐ, ์ด๋ฅผ ์ž˜ ์ •์˜ํ•˜๊ณ  ํ•ด์†Œ๋  ์ˆ˜ ์žˆ๋Š” ์ค‘์˜์„ฑ์ž„์„ ํŒŒ์•…ํ•œ ํ›„ ์ค‘์˜์ ์ธ ๋ถ€๋ถ„ ๊ฐ„์˜ ๊ฒฝ๊ณ„๋ฅผ ์ž˜ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋ณธ๊ณ ์—์„œ๋Š” ์Œ์„ฑ ์–ธ์–ด ์ฒ˜๋ฆฌ, ํŠนํžˆ ์˜๋„ ์ดํ•ด์— ์žˆ์–ด ์–ด๋–ค ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ณ , ์ด๋ฅผ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๋ฐœ์ƒํ•˜์ง€๋งŒ, ๊ทธ ์ •๋„ ๋ฐ ์–‘์ƒ์€ ์–ธ์–ด์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ์—์„œ ์ฃผ๋ชฉํ•˜๋Š” ๋ถ€๋ถ„์€, ์Œ์„ฑ ์–ธ์–ด์— ๋‹ด๊ธด ์ •๋ณด๋Ÿ‰๊ณผ ๋ฌธ์ž ์–ธ์–ด์˜ ์ •๋ณด๋Ÿ‰ ์ฐจ์ด๋กœ ์ธํ•ด ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๋“ค์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์šด์œจ(prosody)์— ๋”ฐ๋ผ ๋ฌธ์žฅ ํ˜•์‹ ๋ฐ ์˜๋„๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€ ํ•œ๊ตญ์–ด๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค. ํ•œ๊ตญ์–ด์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์ด ์žˆ๋Š”(multi-functionalํ•œ) ์ข…๊ฒฐ์–ด๋ฏธ(sentence ender), ๋นˆ๋ฒˆํ•œ ํƒˆ๋ฝ ํ˜„์ƒ(pro-drop), ์˜๋ฌธ์‚ฌ ๊ฐ„์„ญ(wh-intervention) ๋“ฑ์œผ๋กœ ์ธํ•ด, ๊ฐ™์€ ํ…์ŠคํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ์˜๋„๋กœ ์ฝํžˆ๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ณค ํ•œ๋‹ค. ์ด๊ฒƒ์ด ์˜๋„ ์ดํ•ด์— ํ˜ผ์„ ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฐ์— ์ฐฉ์•ˆํ•˜์—ฌ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์˜์„ฑ์„ ๋จผ์ € ์ •์˜ํ•˜๊ณ , ์ค‘์˜์ ์ธ ๋ฌธ์žฅ๋“ค์„ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค. ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฌธ์žฅ์˜ ์ง€ํ–ฅ์„ฑ(directivity)๊ณผ ์ˆ˜์‚ฌ์„ฑ(rhetoricalness)์ด ๊ณ ๋ ค๋œ๋‹ค. ์ด๊ฒƒ์€ ์Œ์„ฑ ์–ธ์–ด์˜ ์˜๋„๋ฅผ ์„œ์ˆ , ์งˆ๋ฌธ, ๋ช…๋ น, ์ˆ˜์‚ฌ์˜๋ฌธ๋ฌธ, ๊ทธ๋ฆฌ๊ณ  ์ˆ˜์‚ฌ๋ช…๋ น๋ฌธ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ์ค€์ด ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ๋ก๋œ ์Œ์„ฑ ์–ธ์–ด(spoken language)๋ฅผ ์ถฉ๋ถ„ํžˆ ๋†’์€ ์ผ์น˜๋„(kappa = 0.85)๋กœ ์ฃผ์„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ์ด์šฉํ•ด, ์Œ์„ฑ์ด ์ฃผ์–ด์ง€์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์ค‘์˜์ ์ธ ํ…์ŠคํŠธ๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ์— ์–ด๋–ค ์ „๋žต ํ˜น์€ ์–ธ์–ด ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์ธ๊ฐ€๋ฅผ ๋ณด์ด๊ณ , ํ•ด๋‹น ํƒœ์Šคํฌ์˜ ํŠน์ง•์„ ์ •์„ฑ์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ํ…์ŠคํŠธ ์ธต์œ„์—์„œ๋งŒ ์ค‘์˜์„ฑ์— ์ ‘๊ทผํ•˜์ง€ ์•Š๊ณ , ์‹ค์ œ๋กœ ์Œ์„ฑ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ ํ•ด์†Œ(disambiguation)๊ฐ€ ๊ฐ€๋Šฅํ•œ์ง€๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด, ํ…์ŠคํŠธ๊ฐ€ ์ค‘์˜์ ์ธ ๋ฐœํ™”๋“ค๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ธ๊ณต์ ์ธ ์Œ์„ฑ ๋ง๋ญ‰์น˜๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ง‘์ค‘(attention) ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง(neural network) ๋ชจ๋ธ๋“ค์„ ์ด์šฉํ•ด ์ค‘์˜์„ฑ์„ ํ•ด์†Œํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ†ต์‚ฌ์ /์˜๋ฏธ์  ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ์–ด๋– ํ•œ ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ๊ด€์ฐฐํ•˜๊ณ , ์ธ๊ฐ„์˜ ์–ธ์–ด ์ฒ˜๋ฆฌ์™€ ์–ด๋–ค ์—ฐ๊ด€์ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„์™€ ๊ฐ™์€ ์ ˆ์ฐจ๋กœ ์˜๋„ ์ดํ•ด ๊ณผ์ •์—์„œ์˜ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋˜์—ˆ์„ ๊ฒฝ์šฐ, ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฐ์—…๊ณ„ ํ˜น์€ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ๋กœ๋“œ๋งต์„ ์ œ์‹œํ•œ๋‹ค. ํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์ค‘์˜์„ฑ ํŒŒ์•…๊ณผ ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ์˜๋„ ์ดํ•ด ๋ชจ๋“ˆ์„ ํ†ตํ•ฉํ•œ๋‹ค๋ฉด, ์˜ค๋ฅ˜์˜ ์ „ํŒŒ๋ฅผ ์ค„์ด๋ฉด์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ค‘์˜์„ฑ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์€ ๋Œ€ํ™” ๋งค๋‹ˆ์ €(dialogue manager)์™€ ํ†ตํ•ฉ๋˜์–ด ๊ฐ„๋‹จํ•œ ๋Œ€ํ™”(chit-chat)๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ชฉ์  ์ง€ํ–ฅ ๋Œ€ํ™” ์‹œ์Šคํ…œ(task-oriented dialogue system)์„ ๊ตฌ์ถ•ํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋‹จ์ผ ์–ธ์–ด ์กฐ๊ฑด(monolingual condition)์„ ๋„˜์–ด ์Œ์„ฑ ๋ฒˆ์—ญ์—์„œ์˜ ์—๋Ÿฌ๋ฅผ ์ค„์ด๋Š” ๋ฐ์— ํ™œ์šฉ๋  ์ˆ˜๋„ ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ณธ๊ณ ๋ฅผ ํ†ตํ•ด, ์šด์œจ์— ๋ฏผ๊ฐํ•œ(prosody-sensitive) ์–ธ์–ด์—์„œ ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ด๋ฅผ ์‚ฐ์—… ๋ฐ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด๊ณ ์ž ํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๊ฐ€ ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ ๋„๋ฉ”์ธ์—์„œ๋„ ๊ณ ์งˆ์ ์ธ ์ค‘์˜์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋ผ๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ์— ํ™œ์šฉ๋œ ๋ฆฌ์†Œ์Šค, ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ์ฝ”๋“œ๋“ค์„ ๊ณต์œ ํ•จ์œผ๋กœ์จ ํ•™๊ณ„์˜ ๋ฐœ์ „์— ์ด๋ฐ”์ง€ํ•˜๊ณ ์ž ํ•œ๋‹ค.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages โ€‹โ€‹or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126๋ฐ•

    Exploring language contact and use among globally mobile populations: a qualitative study of English-speaking short-stay academic sojourners in the Republic of Korea

    Get PDF
    This study explores the language contact and use of English speaking sojourners in the Republic of Korea who had no prior knowledge of Korean language or culture prior to arriving in the country. The study focuses on the use of mobile technology assisted l anguage use. Study participants responded to an online survey about their experiences using the Korean language when interacting with Korean speakers, their free time activities, and the types of digital and mobile technologies they used. The survey respon ses informed questions for later discussion groups, in which participants discussed challenges and solutions when encountering new linguistic and social scenarios with Korean speakers. Semi structured interviews were employed to examine the linguistic, soc ial and technological dimensions of the study participantsโ€™ brief sojourn in Korea in more depth. The interviews revealed a link between language contact, language use and a mobile instant messaging application. In the second phase of the study, online surveys focused on the language and technology link discovered in the first phase. Throughout Phase Two , the researcher observed the study participants in a series of social contexts, such as informal English practice and university events. Phase Two concluded with semi structured interviews that demonstrated language contact and use within mobile instant messaging chat rooms on participantsโ€™ handheld smart devices. The two phases revealed three key factors influencing the language contact and use between the study participants and Korean speakers. Firstly, a mutual perspicacity for mobile technologies and digital communication supported their mediated, screen to screen and blended direct and mediated face to screen interactions. Secondly, Koreaโ€™s advanced digital environment comprised handheld smart devices, smart device applications and ubiquitous, high speed Wi Fi their Korean speaking hosts to self reliance. Thirdly, language use between the study participants and Korean speakers incorporated a range of sociolinguistic resources including the exchange of symbols, small expressive images, photographs, video and audio recordings along with or in place of typed text. Using these resources also helped the study participants learn and take part in social and cultural practices, such as gifting digitally, within mobile instant messaging chat rooms. The findings of the study are drawn together in a new conceptual model which h as been called sociolinguistic digital acuity , highlighting the optimal conditions for language contact and use during a brief sojourn in a country with an unfamiliar language and culture

    Expanding horizons of cross-linguistic research on reading: The Multilingual Eye-movement Corpus (MECO)

    Get PDF
    Scientific studies of language behavior need to grapple with a large diversity of languages in the world and, for reading, a further variability in writing systems. Yet, the ability to form meaningful theories of reading is contingent on the availability of cross-linguistic behavioral data. This paper offers new insights into aspects of reading behavior that are shared and those that vary systematically across languages through an investigation of eye-tracking data from 13 languages recorded during text reading. We begin with reporting a bibliometric analysis of eye-tracking studies showing that the current empirical base is insufficient for cross-linguistic comparisons. We respond to this empirical lacuna by presenting the Multilingual Eye-Movement Corpus (MECO), the product of an international multi-lab collaboration. We examine which behavioral indices differentiate between reading in written languages, and which measures are stable across languages. One of the findings is that readers of different languages vary considerably in their skipping rate (i.e., the likelihood of not fixating on a word even once) and that this variability is explained by cross-linguistic differences in word length distributions. In contrast, if readers do not skip a word, they tend to spend a similar average time viewing it. We outline the implications of these findings for theories of reading. We also describe prospective uses of the publicly available MECO data, and its further development plans
    • โ€ฆ
    corecore