2,473 research outputs found

    ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๊น€๋‚จ์ˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ์ˆ ์€ ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ํš”๋ฐœํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์˜ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ํ•ฉ์„ฑ ํ’ˆ์งˆ์€ ๋น„์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์•„์ง ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ์—๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ ์Œํ–ฅ ๋ชจ๋ธ์˜ deterministicํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์Šคํƒ€์ผ์„ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ ๊ฐ•์ธํ•œ ์–ดํ…์…˜(attention)์— ๋Œ€ํ•œ ์ด์Šˆ๊ฐ€ ๋Š์ž„์—†์ด ์žฌ๊ธฐ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋‰ด๋Ÿด ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐฉ์‹์˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง์„ ๊ณ ๋„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ adversarially trained variational recurrent neural network (AdVRNN) ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. AdVRNN ๊ธฐ๋ฒ•์€ VRNN์„ ์Œ์„ฑ ํ•ฉ์„ฑ์— ์ ์šฉํ•˜์—ฌ ์Œ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ stochastic ํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ ๋Œ€์  ํ•™์Šต์ (adversarial learning) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ oversmoothing ๋ฌธ์ œ๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด์˜ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ์œ„ํ•œ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ global style token(GST) ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋น„์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์Šคํƒ€์ผ์ด ์žˆ์–ด๋„ ์ด๋ฅผ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด GST์˜ ์ถœ๋ ฅ๊ณผ ๋ชฉํ‘œ ์Šคํƒ€์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™” ํ•˜๋„๋ก ํ•™์Šต ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ mutual information neural estimator(MINE) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€๊ณ  ๋‹คํ™”์ž ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ GST ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ชฉํ‘œ ์Šคํƒ€์ผ์„ ๋ณด๋‹ค ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์„ธ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๊ฐ•์ธํ•œ ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์–ดํ…์…˜์ธ memory attention์„ ์ œ์•ˆํ•œ๋‹ค. Long-short term memory(LSTM)์˜ gating ๊ธฐ์ˆ ์€ sequence๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์™”๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ์–ดํ…์…˜์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๊ฐ€์ง„ ์Œ์„ฑ์—์„œ๋„ ์–ดํ…์…˜์˜ ๋Š๊น€, ๋ฐ˜๋ณต ๋“ฑ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹จ์ผ ํ™”์ž์™€ ๊ฐ์ • ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ํ† ๋Œ€๋กœ memory attention์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ณก์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, selective multi-attention (SMA)์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‚ญ๋…์ฒด ๋‹จ์ผํ™”์ž์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์€ ๋‹จ์ผ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์Šคํƒ€์ผ ์Œ์„ฑ์˜ ๊ฒฝ์šฐ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ํ‘œํ˜„์„ ์š”๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜์—ฌ ํ›„๋ณด๋“ค์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์„ ํƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์–ดํ…์…˜์„ ์„ ํƒํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. SMA ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์˜ ์–ดํ…์…˜๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ๋ณด๋‹ค ๋งŽ์€ ์Šคํƒ€์ผ์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 ์š”์•ฝ 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources

    Full text link
    [ES] En los รบltimos aรฑos, el aprendizaje profundo ha cambiado significativamente el panorama en diversas รกreas del campo de la inteligencia artificial, entre las que se incluyen la visiรณn por computador, el procesamiento del lenguaje natural, robรณtica o teorรญa de juegos. En particular, el sorprendente รฉxito del aprendizaje profundo en mรบltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automรกtico del habla (ASR), la traducciรณn automรกtica (MT) o la sรญntesis de voz (TTS), ha supuesto una mejora drรกstica en la precisiรณn de estos sistemas, extendiendo asรญ su implantaciรณn a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologรญas de reconocimiento automรกtico del habla y traducciรณn automรกtica pueden ser empleadas para producir, de forma efectiva, subtรญtulos multilingรผes de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vรญdeos educativos, donde las condiciones acรบsticas son normalmente favorables para los sistemas de ASR y el discurso estรก gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavรญa debe comprobarse si esta tecnologรญa estรก lo suficientemente madura como para mejorar la accesibilidad y la participaciรณn en el aprendizaje en lรญnea. Ademรกs, existen diversas tareas en el campo de la sรญntesis de voz que todavรญa suponen un reto, como la clonaciรณn de voz inter-lingรผe, la sรญntesis incremental o la adaptaciรณn zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de sรญntesis de voz basados en redes neuronales, asรญ como la extensiรณn de su aplicaciรณn en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en lรญnea. En este sentido, este trabajo presta especial atenciรณn a la adaptaciรณn a nuevos locutores y a la clonaciรณn de voz inter-lingรผe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.[CA] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses ร rees del camp de la intelยทligรจncia artificial, entre les quals s'inclouen la visiรณ per computador, el processament del llenguatge natural, robรฒtica o la teoria de jocs. En particular, el sorprenent รจxit de l'aprenentatge profund en mรบltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automร tic de la parla (ASR), la traducciรณ automร tica (MT) o la sรญntesi de veu (TTS), ha suposat una millora drร stica en la precisiรณ i qualitat d'aquests sistemes, estenent aixรญ la seva implantaciรณ a un ventall mรฉs ampli a la vida real. En aquest moment, รฉs evident que les tecnologies de reconeixement automร tic de la parla i traducciรณ automร tica poden ser emprades per a produir, de forma efectiva, subtรญtols multilingรผes d'alta qualitat de continguts audiovisuals. Aixรฒ รฉs particularment cert en el context dels vรญdeos educatius, on les condicions acรบstiques sรณn normalment favorables per als sistemes d'ASR i el discurs estร  gramaticalment ben format. No obstant aixรฒ, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaรงos de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia รฉs ja prou madura com per millorar l'accessibilitat i la participaciรณ en l'aprenentatge en lรญnia. A mรฉs, hi ha diverses tasques al camp de la sรญntesi de veu que encara suposen un repte, com ara la clonaciรณ de veu inter-lingรผe, la sรญntesi incremental o l'adaptaciรณ zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de sรญntesi de veu basats en xarxes neuronals, aixรญ com l'extensiรณ de la seva aplicaciรณ en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en lรญnia. En aquest sentit, aquest treball presta especial atenciรณ a l'adaptaciรณ a nous locutors i a la clonaciรณ de veu interlingรผe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.Pรฉrez Gonzรกlez De Martos, AM. (2022). Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources [Tesis doctoral]. Universitat Politรจcnica de Valรจncia. https://doi.org/10.4995/Thesis/10251/184019TESISPremios Extraordinarios de tesis doctorale

    ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜๊ณผ ๋‹ค๋ณ€์ˆ˜ ์ •๋ณด ์ตœ์†Œํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ์ฒœ์„ฑ์ค€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.์Œ์„ฑ์€ ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ๋‹ค๋ฅธ ์ผ์„ ํ•˜๋ฉด์„œ๋„, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์ƒ๋Œ€์™€ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์œ ์šฉํ•œ ์ธํ„ฐํŽ˜์ด์Šค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ์ด ์ƒํ™œ์—์„œ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์ ‘ํ•˜๋Š” ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ, ๊ฐ€์ „, ์ž๋™์ฐจ ๋“ฑ์—์„œ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋˜๋ฉด์„œ, ๊ธฐ๊ณ„์™€ ์‚ฌ๋žŒ ๊ฐ„์˜ ์Œ์„ฑ ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋‚ ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ๊ณ„๊ฐ€ ์Œ์„ฑ์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ธ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ๋‹ค๋ฃฌ๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์ด ์ ์šฉ๋˜๋ฉด์„œ ํ•ฉ์„ฑ๋œ ์Œ์„ฑ์˜ ํ’ˆ์งˆ์€ ์‚ฌ๋žŒ์˜ ์Œ์„ฑ๊ณผ ์œ ์‚ฌํ•ด์กŒ์ง€๋งŒ, ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šคํƒ€์ผ์˜ ์ œ์–ด๋Š” ์•„์ง๋„ ๋„์ „์ ์ธ ๊ณผ์ œ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์šด์œจ๊ณผ ๊ฐ์ •์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜๋ฉฐ, ์Šคํƒ€์ผ์„ ์š”์†Œ๋ณ„๋กœ ์ œ์–ดํ•˜์—ฌ ์†์‰ฝ๊ฒŒ ์›ํ•˜๋Š” ์Šคํƒ€์ผ์˜ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ์กด ์Šคํƒ€์ผ ์ œ์–ด ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ™”์ž, ๊ฐ์ •, ๋งํˆฌ๋‚˜, ์Œ์šด ๋“ฑ์„ ์ œ์–ดํ•˜๋ฉด์„œ๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐœํ™”๋ฅผ ํ•ฉ์„ฑํ•˜๊ณ ์ž ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•๋“ค๊ณผ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ ๋‘ ์‹œํ€€์Šค(sequence) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ, ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋”ฐ๋ผ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์–ดํ…์…˜(attention) ๊ธฐ๋ฒ•์— ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์žฌ๊ท€์„ฑ์„ ์ถ”๊ฐ€ํ•œ ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜(Gated Recurrent Attention) ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ์ผ์ •ํ•œ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ถœ๋ ฅ ์œ„์น˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ๋‹ค์–‘ํ•œ ์ถœ๋ ฅ์„ ๋‘ ๊ฐœ์˜ ๊ฒŒ์ดํŠธ๋ฅผ ํ†ตํ•ด ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์ ํ•ฉํ•˜๋‹ค. ๊ฒŒ์ดํŠธ ์žฌ๊ท€ ์–ดํ…์…˜์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†์—ˆ๋˜ ์Šคํƒ€์ผ์„ ํ•™์Šตํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ธฐ์กด ๊ธฐ๋ฒ•์— ๋น„ํ•ด ์ž์—ฐ์Šค๋Ÿฌ์›€์ด๋‚˜ ์Šคํƒ€์ผ ์œ ์‚ฌ๋„ ๋ฉด์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ ์„ธ ๊ฐœ ์ด์ƒ์˜ ์Šคํƒ€์ผ ์š”์†Œ๋“ค์˜ ์ƒํ˜ธ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ œ์–ด ์š”์†Œ๋“ค(factors)์„ ๋ณ€์ˆ˜๊ฐ„ ์ƒํ˜ธ์˜์กด์„ฑ ์ƒํ•œ ํ•ญ๋“ค์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ์˜์กด์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ด ์ƒํ•œ ์ถ”์ •์น˜๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— ์ˆ˜๋ ดํ•˜์—ฌ 0์— ๊ฐ€๊น๊ฒŒ ์œ ์ง€๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์†์‹คํ•จ์ˆ˜๋ฅผ ๋”ํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋‹ค์–ธ์–ด, ๋‹คํ™”์ž, ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ ์Œ์„ฑํ•ฉ์„ฑ๊ธฐ๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ ํ™œ์šฉ๋œ๋‹ค. 15๋ช…์˜ ์Œ์„ฑ ์ „๋ฌธ๊ฐ€๋“ค์˜ ์ฃผ๊ด€์ ์ธ ๋“ฃ๊ธฐ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์ด ํ•ฉ์„ฑ๊ธฐ์˜ ์Šคํƒ€์ผ ์ œ์–ด๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•ฉ์„ฑ์Œ์˜ ํ’ˆ์งˆ๊นŒ์ง€ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 ์ดˆ ๋ก 67 ๊ฐ์‚ฌ์˜ ๊ธ€ 69๋ฐ•

    He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues

    Full text link
    In this work, we define a new style transfer task: perspective shift, which reframes a dialogue from informal first person to a formal third person rephrasing of the text. This task requires challenging coreference resolution, emotion attribution, and interpretation of informal text. We explore several baseline approaches and discuss further directions on this task when applied to short dialogues. As a sample application, we demonstrate that applying perspective shifting to a dialogue summarization dataset (SAMSum) substantially improves the zero-shot performance of extractive news summarization models on this data. Additionally, supervised extractive models perform better when trained on perspective shifted data than on the original dialogues. We release our code publicly.Comment: Findings of EMNLP 2022, 18 page
    • โ€ฆ
    corecore