15,140 research outputs found

    ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๊น€๋‚จ์ˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ์ˆ ์€ ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ํš”๋ฐœํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์˜ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ํ•ฉ์„ฑ ํ’ˆ์งˆ์€ ๋น„์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์•„์ง ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ์—๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ ์Œํ–ฅ ๋ชจ๋ธ์˜ deterministicํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์Šคํƒ€์ผ์„ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ ๊ฐ•์ธํ•œ ์–ดํ…์…˜(attention)์— ๋Œ€ํ•œ ์ด์Šˆ๊ฐ€ ๋Š์ž„์—†์ด ์žฌ๊ธฐ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋‰ด๋Ÿด ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐฉ์‹์˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง์„ ๊ณ ๋„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ adversarially trained variational recurrent neural network (AdVRNN) ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. AdVRNN ๊ธฐ๋ฒ•์€ VRNN์„ ์Œ์„ฑ ํ•ฉ์„ฑ์— ์ ์šฉํ•˜์—ฌ ์Œ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ stochastic ํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ ๋Œ€์  ํ•™์Šต์ (adversarial learning) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ oversmoothing ๋ฌธ์ œ๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด์˜ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ์œ„ํ•œ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ global style token(GST) ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋น„์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์Šคํƒ€์ผ์ด ์žˆ์–ด๋„ ์ด๋ฅผ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด GST์˜ ์ถœ๋ ฅ๊ณผ ๋ชฉํ‘œ ์Šคํƒ€์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™” ํ•˜๋„๋ก ํ•™์Šต ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ mutual information neural estimator(MINE) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€๊ณ  ๋‹คํ™”์ž ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ GST ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ชฉํ‘œ ์Šคํƒ€์ผ์„ ๋ณด๋‹ค ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์„ธ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๊ฐ•์ธํ•œ ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์–ดํ…์…˜์ธ memory attention์„ ์ œ์•ˆํ•œ๋‹ค. Long-short term memory(LSTM)์˜ gating ๊ธฐ์ˆ ์€ sequence๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์™”๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ์–ดํ…์…˜์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๊ฐ€์ง„ ์Œ์„ฑ์—์„œ๋„ ์–ดํ…์…˜์˜ ๋Š๊น€, ๋ฐ˜๋ณต ๋“ฑ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹จ์ผ ํ™”์ž์™€ ๊ฐ์ • ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ํ† ๋Œ€๋กœ memory attention์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ณก์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, selective multi-attention (SMA)์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‚ญ๋…์ฒด ๋‹จ์ผํ™”์ž์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์€ ๋‹จ์ผ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์Šคํƒ€์ผ ์Œ์„ฑ์˜ ๊ฒฝ์šฐ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ํ‘œํ˜„์„ ์š”๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜์—ฌ ํ›„๋ณด๋“ค์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์„ ํƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์–ดํ…์…˜์„ ์„ ํƒํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. SMA ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์˜ ์–ดํ…์…˜๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ๋ณด๋‹ค ๋งŽ์€ ์Šคํƒ€์ผ์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 ์š”์•ฝ 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    Nonparallel Emotional Speech Conversion

    Full text link
    We propose a nonparallel data-driven emotional speech conversion method. It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker's identity and linguistic content. Most existing approaches require parallel data and time alignment, which is not available in most real applications. We achieve nonparallel training based on an unsupervised style transfer technique, which learns a translation model between two distributions instead of a deterministic one-to-one mapping between paired examples. The conversion model consists of an encoder and a decoder for each emotion domain. We assume that the speech signal can be decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion. We tested our method on a nonparallel corpora with four emotions. Both subjective and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation available at http://www.jian-gao.org/emoga

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

    Full text link
    A robot needs contextual awareness, effective speech production and complementing non-verbal gestures for successful communication in society. In this paper, we present our end-to-end system that tries to enhance the effectiveness of non-verbal gestures. For achieving this, we identified prominently used gestures in performances by TED speakers and mapped them to their corresponding speech context and modulated speech based upon the attention of the listener. The proposed method utilized Convolutional Pose Machine [4] to detect the human gesture. Dominant gestures of TED speakers were used for learning the gesture-to-speech mapping. The speeches by them were used for training the model. We also evaluated the engagement of the robot with people by conducting a social survey. The effectiveness of the performance was monitored by the robot and it self-improvised its speech pattern on the basis of the attention level of the audience, which was calculated using visual feedback from the camera. The effectiveness of interaction as well as the decisions made during improvisation was further evaluated based on the head-pose detection and interaction survey.Comment: 8 pages, 9 figures, Under review in IRC 201

    Detecting User Engagement in Everyday Conversations

    Full text link
    This paper presents a novel application of speech emotion recognition: estimation of the level of conversational engagement between users of a voice communication system. We begin by using machine learning techniques, such as the support vector machine (SVM), to classify users' emotions as expressed in individual utterances. However, this alone fails to model the temporal and interactive aspects of conversational engagement. We therefore propose the use of a multilevel structure based on coupled hidden Markov models (HMM) to estimate engagement levels in continuous natural speech. The first level is comprised of SVM-based classifiers that recognize emotional states, which could be (e.g.) discrete emotion types or arousal/valence levels. A high-level HMM then uses these emotional states as input, estimating users' engagement in conversation by decoding the internal states of the HMM. We report experimental results obtained by applying our algorithms to the LDC Emotional Prosody and CallFriend speech corpora.Comment: 4 pages (A4), 1 figure (EPS

    ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

    Full text link
    Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.Comment: Accepted by INTERSPEECH 202
    • โ€ฆ
    corecore