8,260 research outputs found

    CAPT๋ฅผ ์œ„ํ•œ ๋ฐœ์Œ ๋ณ€์ด ๋ถ„์„ ๋ฐ CycleGAN ๊ธฐ๋ฐ˜ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ธ๋ฌธ๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ธ์ง€๊ณผํ•™์ „๊ณต,2020. 2. ์ •๋ฏผํ™”.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.์™ธ๊ตญ์–ด๋กœ์„œ์˜ ํ•œ๊ตญ์–ด ๊ต์œก์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๊ณ ์กฐ๋˜์–ด ํ•œ๊ตญ์–ด ํ•™์Šต์ž์˜ ์ˆ˜๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์Œ์„ฑ์–ธ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ˆ ์„ ์ ์šฉํ•œ ์ปดํ“จํ„ฐ ๊ธฐ๋ฐ˜ ๋ฐœ์Œ ๊ต์œก(Computer-Assisted Pronunciation Training; CAPT) ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์— ๋Œ€ํ•œ ์—ฐ๊ตฌ ๋˜ํ•œ ์ ๊ทน์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ˜„์กดํ•˜๋Š” ํ•œ๊ตญ์–ด ๋งํ•˜๊ธฐ ๊ต์œก ์‹œ์Šคํ…œ์€ ์™ธ๊ตญ์ธ์˜ ํ•œ๊ตญ์–ด์— ๋Œ€ํ•œ ์–ธ์–ดํ•™์  ํŠน์ง•์„ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ  ์žˆ์œผ๋ฉฐ, ์ตœ์‹  ์–ธ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ˆ  ๋˜ํ•œ ์ ์šฉ๋˜์ง€ ์•Š๊ณ  ์žˆ๋Š” ์‹ค์ •์ด๋‹ค. ๊ฐ€๋Šฅํ•œ ์›์ธ์œผ๋กœ์จ๋Š” ์™ธ๊ตญ์ธ ๋ฐœํ™” ํ•œ๊ตญ์–ด ํ˜„์ƒ์— ๋Œ€ํ•œ ๋ถ„์„์ด ์ถฉ๋ถ„ํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜๋‹ค๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ๊ด€๋ จ ์—ฐ๊ตฌ๊ฐ€ ์žˆ์–ด๋„ ์ด๋ฅผ ์ž๋™ํ™”๋œ ์‹œ์Šคํ…œ์— ๋ฐ˜์˜ํ•˜๊ธฐ์—๋Š” ๊ณ ๋„ํ™”๋œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์ด ์žˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ CAPT ๊ธฐ์ˆ  ์ „๋ฐ˜์ ์œผ๋กœ๋Š” ์‹ ํ˜ธ์ฒ˜๋ฆฌ, ์šด์œจ ๋ถ„์„, ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•๊ณผ ๊ฐ™์€ ํŠน์ง• ์ถ”์ถœ์— ์˜์กดํ•˜๊ณ  ์žˆ์–ด์„œ ์ ํ•ฉํ•œ ํŠน์ง•์„ ์ฐพ๊ณ  ์ด๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ถ”์ถœํ•˜๋Š” ๋ฐ์— ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์ด ํ•„์š”ํ•œ ์‹ค์ •์ด๋‹ค. ์ด๋Š” ์ตœ์‹  ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์–ธ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ˆ ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ์ด ๊ณผ์ • ๋˜ํ•œ ๋ฐœ์ „์˜ ์—ฌ์ง€๊ฐ€ ๋งŽ๋‹ค๋Š” ๋ฐ”๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ๋Š” ๋จผ์ € CAPT ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์— ์žˆ์–ด ๋ฐœ์Œ ๋ณ€์ด ์–‘์ƒ๊ณผ ์–ธ์–ดํ•™์  ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜์˜€๋‹ค. ์™ธ๊ตญ์ธ ํ™”์ž๋“ค์˜ ๋‚ญ๋…์ฒด ๋ณ€์ด ์–‘์ƒ๊ณผ ํ•œ๊ตญ์–ด ์›์–ด๋ฏผ ํ™”์ž๋“ค์˜ ๋‚ญ๋…์ฒด ๋ณ€์ด ์–‘์ƒ์„ ๋Œ€์กฐํ•˜๊ณ  ์ฃผ์š”ํ•œ ๋ณ€์ด๋ฅผ ํ™•์ธํ•œ ํ›„, ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„์„ ํ†ตํ•˜์—ฌ ์˜์‚ฌ์†Œํ†ต์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ค‘์š”๋„๋ฅผ ํŒŒ์•…ํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์ข…์„ฑ ์‚ญ์ œ์™€ 3์ค‘ ๋Œ€๋ฆฝ์˜ ํ˜ผ๋™, ์ดˆ๋ถ„์ ˆ ๊ด€๋ จ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ๊ฒฝ์šฐ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ์— ์šฐ์„ ์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. ๊ต์ •๋œ ํ”ผ๋“œ๋ฐฑ์„ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ CAPT ์‹œ์Šคํ…œ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด ๊ณผ์ œ๊ฐ€ ๋ฐœํ™”์˜ ์Šคํƒ€์ผ ๋ณ€ํ™”์˜ ๋ฌธ์ œ๋กœ ํ•ด์„์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ๋ณด์•˜์œผ๋ฉฐ, ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง (Cycle-consistent Generative Adversarial Network; CycleGAN) ๊ตฌ์กฐ์—์„œ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. GAN ๋„คํŠธ์›Œํฌ์˜ ์ƒ์„ฑ๋ชจ๋ธ์€ ๋น„์›์–ด๋ฏผ ๋ฐœํ™”์˜ ๋ถ„ํฌ์™€ ์›์–ด๋ฏผ ๋ฐœํ™” ๋ถ„ํฌ์˜ ๋งคํ•‘์„ ํ•™์Šตํ•˜๋ฉฐ, Cycle consistency ์†์‹คํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋ฐœํ™”๊ฐ„ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•จ๊ณผ ๋™์‹œ์— ๊ณผ๋„ํ•œ ๊ต์ •์„ ๋ฐฉ์ง€ํ•˜์˜€๋‹ค. ๋ณ„๋„์˜ ํŠน์ง• ์ถ”์ถœ ๊ณผ์ •์ด ์—†์ด ํ•„์š”ํ•œ ํŠน์ง•๋“ค์ด CycleGAN ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๋ฌด๊ฐ๋… ๋ฐฉ๋ฒ•์œผ๋กœ ์Šค์Šค๋กœ ํ•™์Šต๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์–ธ์–ด ํ™•์žฅ์ด ์šฉ์ดํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์–ธ์–ดํ•™์  ๋ถ„์„์—์„œ ๋“œ๋Ÿฌ๋‚œ ์ฃผ์š”ํ•œ ๋ณ€์ด๋“ค ๊ฐ„์˜ ์šฐ์„ ์ˆœ์œ„๋Š” Auxiliary Classifier CycleGAN ๊ตฌ์กฐ์—์„œ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ CycleGAN์— ์ง€์‹์„ ์ ‘๋ชฉ์‹œ์ผœ ํ”ผ๋“œ๋ฐฑ ์Œ์„ฑ์„ ์ƒ์„ฑํ•จ๊ณผ ๋™์‹œ์— ํ•ด๋‹น ํ”ผ๋“œ๋ฐฑ์ด ์–ด๋–ค ์œ ํ˜•์˜ ์˜ค๋ฅ˜์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋Š” ๋„๋ฉ”์ธ ์ง€์‹์ด ๊ต์ • ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ ๋‹จ๊ณ„๊นŒ์ง€ ์œ ์ง€๋˜๊ณ  ํ†ต์ œ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค๋Š” ๋ฐ์— ๊ทธ ์˜์˜๊ฐ€ ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ 27๊ฐœ์˜ ๋ชจ๊ตญ์–ด๋ฅผ ๊ฐ–๋Š” 217๋ช…์˜ ์œ ์˜๋ฏธ ์–ดํœ˜ ๋ฐœํ™” 65,100๊ฐœ๋กœ ํ”ผ๋“œ๋ฐฑ ์ž๋™ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , ๊ฐœ์„  ์—ฌ๋ถ€ ๋ฐ ์ •๋„์— ๋Œ€ํ•œ ์ง€๊ฐ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ ํ•™์Šต์ž ๋ณธ์ธ์˜ ๋ชฉ์†Œ๋ฆฌ๋ฅผ ์œ ์ง€ํ•œ ์ฑ„ ๊ต์ •๋œ ๋ฐœ์Œ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์ธ ์Œ๋†’์ด ๋™๊ธฐ์‹ ์ค‘์ฒฉ๊ฐ€์‚ฐ (Pitch-Synchronous Overlap-and-Add) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ์ƒ๋Œ€ ๊ฐœ์„ ๋ฅ  16.67%์ด ํ™•์ธ๋˜์—ˆ๋‹ค.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Clearing the transcription hurdle in dialect corpus building : the corpus of Southern Dutch dialects as case-study

    Get PDF
    This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    20 years of technology and language assessment in Language Learning & Technology

    Get PDF

    20 years of technology and language assessment

    Get PDF
    This review article provides an analysis of the research from the last two decades on the theme of technology and second language assessment. Based on an examination of the assessment scholarship published in Language Learning & Technology since its launch in 1997, we analyzed the review articles, research articles, book reviews, and commentaries as developing one of two primary thrusts of research on technology and language assessment: technology for efficiency and technology for innovation

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS โ€œturntakingโ€ behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems
    • โ€ฆ
    corecore