6 research outputs found

    ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜๊ณผ ๋‹ค๋ณ€์ˆ˜ ์ •๋ณด ์ตœ์†Œํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ฒœ์„ฑ์ค€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.์Œ์„ฑ์€ ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ๋‹ค๋ฅธ ์ผ์„ ํ•˜๋ฉด์„œ๋„, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์ƒ๋Œ€์™€ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์œ ์šฉํ•œ ์ธํ„ฐํŽ˜์ด์Šค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ์ด ์ƒํ™œ์—์„œ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์ ‘ํ•˜๋Š” ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ, ๊ฐ€์ „, ์ž๋™์ฐจ ๋“ฑ์—์„œ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋˜๋ฉด์„œ, ๊ธฐ๊ณ„์™€ ์‚ฌ๋žŒ ๊ฐ„์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋‚ ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ๊ณ„๊ฐ€ ์Œ์„ฑ์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ธ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ๋‹ค๋ฃฌ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์ด ์ ์šฉ๋˜๋ฉด์„œ ํ•ฉ์„ฑ๋œ ์Œ์„ฑ์˜ ํ’ˆ์งˆ์€ ์‚ฌ๋žŒ์˜ ์Œ์„ฑ๊ณผ ์œ ์‚ฌํ•ด์กŒ์ง€๋งŒ, ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šคํƒ€์ผ์˜ ์ œ์–ด๋Š” ์•„์ง๋„ ๋„์ „์ ์ธ ๊ณผ์ œ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์šด์œจ๊ณผ ๊ฐ์ •์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜๋ฉฐ, ์Šคํƒ€์ผ์„ ์š”์†Œ๋ณ„๋กœ ์ œ์–ดํ•˜์—ฌ ์†์‰ฝ๊ฒŒ ์›ํ•˜๋Š” ์Šคํƒ€์ผ์˜ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ์กด ์Šคํƒ€์ผ ์ œ์–ด ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ™”์ž, ๊ฐ์ •, ๋งํˆฌ๋‚˜, ์Œ์šด ๋“ฑ์„ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐœํ™”๋ฅผ ํ•ฉ์„ฑํ•˜๊ณ ์ž ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•๋“ค๊ณผ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ ๋‘ ์‹œํ€€์Šค(sequence) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ, ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋”ฐ๋ผ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์–ดํ…์…˜(attention) ๊ธฐ๋ฒ•์— ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์žฌ๊ท€์„ฑ์„ ์ถ”๊ฐ€ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜(Gated Recurrent Attention) ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ์ผ์ •ํ•œ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ถœ๋ ฅ ์œ„์น˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ๋‹ค์–‘ํ•œ ์ถœ๋ ฅ์„ ๋‘ ๊ฐœ์˜ ๊ฒŒ์ดํŠธ๋ฅผ ํ†ตํ•ด ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์ ํ•ฉํ•˜๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†์—ˆ๋˜ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ธฐ์กด ๊ธฐ๋ฒ•์— ๋น„ํ•ด ์ž์—ฐ์Šค๋Ÿฌ์›€์ด๋‚˜ ์Šคํƒ€์ผ ์œ ์‚ฌ๋„ ๋ฉด์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ์„ธ ๊ฐœ ์ด์ƒ์˜ ์Šคํƒ€์ผ ์š”์†Œ๋“ค์˜ ์ƒํ˜ธ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ œ์–ด ์š”์†Œ๋“ค(factors)์„ ๋ณ€์ˆ˜๊ฐ„ ์ƒํ˜ธ์˜์กด์„ฑ ์ƒํ•œ ํ•ญ๋“ค์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ด ์ƒํ•œ ์ถ”์ •์น˜๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— ์ˆ˜๋ ดํ•˜์—ฌ 0์— ๊ฐ€๊น๊ฒŒ ์œ ์ง€๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์†์‹คํ•จ์ˆ˜๋ฅผ ๋”ํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋‹ค์–ธ์–ด, ๋‹คํ™”์ž, ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ ์Œ์„ฑํ•ฉ์„ฑ๊ธฐ๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ํ™œ์šฉ๋œ๋‹ค. 15๋ช…์˜ ์Œ์„ฑ ์ „๋ฌธ๊ฐ€๋“ค์˜ ์ฃผ๊ด€์ ์ธ ๋“ฃ๊ธฐ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์ด ํ•ฉ์„ฑ๊ธฐ์˜ ์Šคํƒ€์ผ ์ œ์–ด๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•ฉ์„ฑ์Œ์˜ ํ’ˆ์งˆ๊นŒ์ง€ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 ์ดˆ ๋ก 67 ๊ฐ์‚ฌ์˜ ๊ธ€ 69๋ฐ•

    Modelling Speech Dynamics with Trajectory-HMMs

    Get PDF
    Institute for Communicating and Collaborative SystemsThe conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a Trajectory-HMM capable of generating smoothed output mean trajectories, by performing a per-utterance normalisation. The resulting model can be trained by either maximisingmodel log-likelihood or minimisingmean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new time-synchronous decoding algorithm built on the concept of token-passing is designed for use in the recognition task. The Trajectory-HMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the Trajectory-HMM on two different speech tasks using the speaker-dependent MOCHA-TIMIT database. First as a generative model to recover articulatory features from speech signal, where the Trajectory-HMM was used in a complementary way to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory framework. Experiments indicate that the jointly trained acoustic-articulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that Trajectory-HMM training results in greater accuracy compared with conventional Baum-Welch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained Trajectory-HMM, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training

    IberSPEECH 2020: XI Jornadas en Tecnologรญa del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, โ€œIberSPEECH 2020: Speech and Language Technologies for Iberian Languagesโ€, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Espaรฑola de Tecnologรญas del Habla. Universidad de Valladoli

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010
    corecore