1,377 research outputs found

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Capture, Learning, and Synthesis of 3D Speaking Styles

    Full text link
    Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

    ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๊น€๋‚จ์ˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ์ˆ ์€ ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ํš”๋ฐœํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์˜ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ํ•ฉ์„ฑ ํ’ˆ์งˆ์€ ๋น„์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์•„์ง ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ์—๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ ์Œํ–ฅ ๋ชจ๋ธ์˜ deterministicํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์Šคํƒ€์ผ์„ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ ๊ฐ•์ธํ•œ ์–ดํ…์…˜(attention)์— ๋Œ€ํ•œ ์ด์Šˆ๊ฐ€ ๋Š์ž„์—†์ด ์žฌ๊ธฐ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋‰ด๋Ÿด ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐฉ์‹์˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง์„ ๊ณ ๋„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ adversarially trained variational recurrent neural network (AdVRNN) ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. AdVRNN ๊ธฐ๋ฒ•์€ VRNN์„ ์Œ์„ฑ ํ•ฉ์„ฑ์— ์ ์šฉํ•˜์—ฌ ์Œ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ stochastic ํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ ๋Œ€์  ํ•™์Šต์ (adversarial learning) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ oversmoothing ๋ฌธ์ œ๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด์˜ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ์œ„ํ•œ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ global style token(GST) ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋น„์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์Šคํƒ€์ผ์ด ์žˆ์–ด๋„ ์ด๋ฅผ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด GST์˜ ์ถœ๋ ฅ๊ณผ ๋ชฉํ‘œ ์Šคํƒ€์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™” ํ•˜๋„๋ก ํ•™์Šต ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ mutual information neural estimator(MINE) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€๊ณ  ๋‹คํ™”์ž ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ GST ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ชฉํ‘œ ์Šคํƒ€์ผ์„ ๋ณด๋‹ค ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์„ธ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๊ฐ•์ธํ•œ ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์–ดํ…์…˜์ธ memory attention์„ ์ œ์•ˆํ•œ๋‹ค. Long-short term memory(LSTM)์˜ gating ๊ธฐ์ˆ ์€ sequence๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์™”๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ์–ดํ…์…˜์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๊ฐ€์ง„ ์Œ์„ฑ์—์„œ๋„ ์–ดํ…์…˜์˜ ๋Š๊น€, ๋ฐ˜๋ณต ๋“ฑ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹จ์ผ ํ™”์ž์™€ ๊ฐ์ • ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ํ† ๋Œ€๋กœ memory attention์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ณก์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, selective multi-attention (SMA)์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‚ญ๋…์ฒด ๋‹จ์ผํ™”์ž์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์€ ๋‹จ์ผ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์Šคํƒ€์ผ ์Œ์„ฑ์˜ ๊ฒฝ์šฐ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ํ‘œํ˜„์„ ์š”๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜์—ฌ ํ›„๋ณด๋“ค์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์„ ํƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์–ดํ…์…˜์„ ์„ ํƒํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. SMA ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์˜ ์–ดํ…์…˜๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ๋ณด๋‹ค ๋งŽ์€ ์Šคํƒ€์ผ์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 ์š”์•ฝ 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    Recent development of the HMM-based speech synthesis system (HTS)

    Get PDF
    A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named โ€œHMM-based speech synthesis system (HTS)โ€ to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

    HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

    Get PDF
    • โ€ฆ
    corecore