138,894 research outputs found

    ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜๊ณผ ๋‹ค๋ณ€์ˆ˜ ์ •๋ณด ์ตœ์†Œํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ฒœ์„ฑ์ค€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.์Œ์„ฑ์€ ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ๋‹ค๋ฅธ ์ผ์„ ํ•˜๋ฉด์„œ๋„, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์ƒ๋Œ€์™€ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์œ ์šฉํ•œ ์ธํ„ฐํŽ˜์ด์Šค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ์ด ์ƒํ™œ์—์„œ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์ ‘ํ•˜๋Š” ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ, ๊ฐ€์ „, ์ž๋™์ฐจ ๋“ฑ์—์„œ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋˜๋ฉด์„œ, ๊ธฐ๊ณ„์™€ ์‚ฌ๋žŒ ๊ฐ„์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋‚ ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ๊ณ„๊ฐ€ ์Œ์„ฑ์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ธ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ๋‹ค๋ฃฌ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์ด ์ ์šฉ๋˜๋ฉด์„œ ํ•ฉ์„ฑ๋œ ์Œ์„ฑ์˜ ํ’ˆ์งˆ์€ ์‚ฌ๋žŒ์˜ ์Œ์„ฑ๊ณผ ์œ ์‚ฌํ•ด์กŒ์ง€๋งŒ, ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šคํƒ€์ผ์˜ ์ œ์–ด๋Š” ์•„์ง๋„ ๋„์ „์ ์ธ ๊ณผ์ œ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์šด์œจ๊ณผ ๊ฐ์ •์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜๋ฉฐ, ์Šคํƒ€์ผ์„ ์š”์†Œ๋ณ„๋กœ ์ œ์–ดํ•˜์—ฌ ์†์‰ฝ๊ฒŒ ์›ํ•˜๋Š” ์Šคํƒ€์ผ์˜ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ์กด ์Šคํƒ€์ผ ์ œ์–ด ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ™”์ž, ๊ฐ์ •, ๋งํˆฌ๋‚˜, ์Œ์šด ๋“ฑ์„ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐœํ™”๋ฅผ ํ•ฉ์„ฑํ•˜๊ณ ์ž ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•๋“ค๊ณผ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ ๋‘ ์‹œํ€€์Šค(sequence) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ, ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋”ฐ๋ผ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์–ดํ…์…˜(attention) ๊ธฐ๋ฒ•์— ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์žฌ๊ท€์„ฑ์„ ์ถ”๊ฐ€ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜(Gated Recurrent Attention) ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ์ผ์ •ํ•œ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ถœ๋ ฅ ์œ„์น˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ๋‹ค์–‘ํ•œ ์ถœ๋ ฅ์„ ๋‘ ๊ฐœ์˜ ๊ฒŒ์ดํŠธ๋ฅผ ํ†ตํ•ด ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์ ํ•ฉํ•˜๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†์—ˆ๋˜ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ธฐ์กด ๊ธฐ๋ฒ•์— ๋น„ํ•ด ์ž์—ฐ์Šค๋Ÿฌ์›€์ด๋‚˜ ์Šคํƒ€์ผ ์œ ์‚ฌ๋„ ๋ฉด์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ์„ธ ๊ฐœ ์ด์ƒ์˜ ์Šคํƒ€์ผ ์š”์†Œ๋“ค์˜ ์ƒํ˜ธ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ œ์–ด ์š”์†Œ๋“ค(factors)์„ ๋ณ€์ˆ˜๊ฐ„ ์ƒํ˜ธ์˜์กด์„ฑ ์ƒํ•œ ํ•ญ๋“ค์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ด ์ƒํ•œ ์ถ”์ •์น˜๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— ์ˆ˜๋ ดํ•˜์—ฌ 0์— ๊ฐ€๊น๊ฒŒ ์œ ์ง€๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์†์‹คํ•จ์ˆ˜๋ฅผ ๋”ํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋‹ค์–ธ์–ด, ๋‹คํ™”์ž, ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ ์Œ์„ฑํ•ฉ์„ฑ๊ธฐ๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ํ™œ์šฉ๋œ๋‹ค. 15๋ช…์˜ ์Œ์„ฑ ์ „๋ฌธ๊ฐ€๋“ค์˜ ์ฃผ๊ด€์ ์ธ ๋“ฃ๊ธฐ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์ด ํ•ฉ์„ฑ๊ธฐ์˜ ์Šคํƒ€์ผ ์ œ์–ด๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•ฉ์„ฑ์Œ์˜ ํ’ˆ์งˆ๊นŒ์ง€ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 ์ดˆ ๋ก 67 ๊ฐ์‚ฌ์˜ ๊ธ€ 69๋ฐ•

    Synthesis using speaker adaptation from speech recognition DB

    Get PDF
    This paper deals with the creation of multiple voices from a Hidden Markov Model based speech synthesis system (HTS). More than 150 Catalan synthetic voices were built using Hidden Markov Models (HMM) and speaker adaptation techniques. Training data for building a Speaker-Independent (SI) model were selected from both a general purpose speech synthesis database (FestCat;) and a database design ed for training Automatic Speech Recognition (ASR) systems (Catalan SpeeCon database). The SpeeCon database was also used to adapt the SI model to different speakers. Using an ASR designed database for TTS purposes provided many different amateur voices, with few minutes of recordings not performed in studio conditions. This paper shows how speaker adaptation techniques provide the right tools to generate multiple voices with very few adaptation data. A subjective evaluation was carried out to assess the intelligibility and naturalness of the generated voices as well as the similarity of the adapted voices to both the original speaker and the average voice from the SI model.Peer ReviewedPostprint (published version

    Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

    Get PDF
    Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performanceโ€“speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version
    • โ€ฆ
    corecore