6 research outputs found
์ ์ด ๊ฐ๋ฅํ ์์ฑ ํฉ์ฑ์ ์ํ ๊ฒ์ดํธ ์ฌ๊ท ์ดํ ์ ๊ณผ ๋ค๋ณ์ ์ ๋ณด ์ต์ํ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021.8. ์ฒ์ฑ์ค.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor.
First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems.
We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques.
We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.์์ฑ์ ์ฌ๋์ด ์์ผ๋ก ๋ค๋ฅธ ์ผ์ ํ๋ฉด์๋, ๋ฉ๋ฆฌ ๋จ์ด์ง ์๋์ ํ์ฉํ ์ ์๋ ๊ฐ์ฅ ์ ์ฉํ ์ธํฐํ์ด์ค ์ค ํ๋์ด๋ค. ๋๋ถ๋ถ์ ์ฌ๋์ด ์ํ์์ ๋ฐ์ ํ๊ฒ ์ ํ๋ ๋ชจ๋ฐ์ผ ๊ธฐ๊ธฐ, ๊ฐ์ , ์๋์ฐจ ๋ฑ์์ ์์ฑ ์ธํฐํ์ด์ค๋ฅผ ํ์ฉํ๊ฒ ๋๋ฉด์, ๊ธฐ๊ณ์ ์ฌ๋ ๊ฐ์ ์์ฑ ์ธํฐํ์ด์ค์ ๋ํ ์ฐ๊ตฌ๊ฐ ๋ ๋ก ์ฆ๊ฐํ๊ณ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์ ๊ธฐ๊ณ๊ฐ ์์ฑ์ ๋ง๋๋ ๊ณผ์ ์ธ ์์ฑ ํฉ์ฑ์ ๋ค๋ฃฌ๋ค. ๋ฅ ๋ฌ๋ ๊ธฐ์ ์ด ์ ์ฉ๋๋ฉด์ ํฉ์ฑ๋ ์์ฑ์ ํ์ง์ ์ฌ๋์ ์์ฑ๊ณผ ์ ์ฌํด์ก์ง๋ง, ์์ฐ์ค๋ฌ์ด ์คํ์ผ์ ์ ์ด๋ ์์ง๋ ๋์ ์ ์ธ ๊ณผ์ ์ด๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ๋ค์ํ ์ด์จ๊ณผ ๊ฐ์ ์ ํํํ ์ ์๋ ์์ฑ์ ํฉ์ฑํ๊ธฐ ์ํ ๊ธฐ๋ฒ๋ค์ ์ ์ํ๋ฉฐ, ์คํ์ผ์ ์์๋ณ๋ก ์ ์ดํ์ฌ ์์ฝ๊ฒ ์ํ๋ ์คํ์ผ์ ์์ฑ์ ํฉ์ฑํ ์ ์๋๋ก ํ๋ ๊ธฐ๋ฒ์ ์ ์ํ๋ค.
๋จผ์ ์์ฑ ํฉ์ฑ์ ์ํด ์ ์๋ ๊ธฐ์กด ์คํ์ผ ์ ์ด ๊ธฐ๋ฒ๋ค์ ์๊ฐํ๋ค. ํ์, ๊ฐ์ , ๋งํฌ๋, ์์ด ๋ฑ์ ์ ์ดํ๋ฉด์๋ ์์ฐ์ค๋ฌ์ด ๋ฐํ๋ฅผ ํฉ์ฑํ๊ณ ์ ํต๊ณ์ ํ๋ผ๋ฏธํฐ ์์ฑ ํฉ์ฑ ์์คํ
์ ์ํด ์ ์๋ ๊ธฐ๋ฒ๋ค๊ณผ, ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ ์์ฑ ํฉ์ฑ ์์คํ
์ ์ํด ์ ์๋ ๊ธฐ๋ฒ์ ์๊ฐํ๋ค.
๋ค์์ผ๋ก ๋ ์ํ์ค(sequence) ๊ฐ์ ๊ด๊ณ๋ฅผ ํ์ตํ์ฌ, ์
๋ ฅ ์ํ์ค์ ๋ฐ๋ผ ์ถ๋ ฅ ์ํ์ค๋ฅผ ์์ฑํ๋ ์ดํ
์
(attention) ๊ธฐ๋ฒ์ ์ ์ด ๊ฐ๋ฅํ ์ฌ๊ท์ฑ์ ์ถ๊ฐํ ๊ฒ์ดํธ ์ฌ๊ท ์ดํ
์
(Gated Recurrent Attention) ๋ฅผ ์ ์ํ๋ค. ๊ฒ์ดํธ ์ฌ๊ท ์ดํ
์
์ ์ผ์ ํ ์
๋ ฅ์ ๋ํด ์ถ๋ ฅ ์์น์ ๋ฐ๋ผ ๋ฌ๋ผ์ง๋ ๋ค์ํ ์ถ๋ ฅ์ ๋ ๊ฐ์ ๊ฒ์ดํธ๋ฅผ ํตํด ์ ์ดํ ์ ์์ด ๋ค์ํ ์คํ์ผ์ ํ์ตํ๋๋ฐ ์ ํฉํ๋ค. ๊ฒ์ดํธ ์ฌ๊ท ์ดํ
์
์ ํ์ต ๋ฐ์ดํฐ์ ์์๋ ์คํ์ผ์ ํ์ตํ๊ณ ์์ฑํ๋๋ฐ ์์ด ๊ธฐ์กด ๊ธฐ๋ฒ์ ๋นํด ์์ฐ์ค๋ฌ์์ด๋ ์คํ์ผ ์ ์ฌ๋ ๋ฉด์์ ๋์ ์ฑ๋ฅ์ ๋ณด์ด๋ ๊ฒ์ ์คํ์ ํตํด ํ์ธํ ์ ์์๋ค.
๋ค์์ผ๋ก ์ธ ๊ฐ ์ด์์ ์คํ์ผ ์์๋ค์ ์ํธ์์กด์ฑ์ ์ ๊ฑฐํ ์ ์๋ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ฌ๋ฌ๊ฐ์ ์ ์ด ์์๋ค(factors)์ ๋ณ์๊ฐ ์ํธ์์กด์ฑ ์ํ ํญ๋ค์ ํฉ์ผ๋ก ๋ํ๋ด๊ณ , ์ด๋ฅผ ์ต์ํํ์ฌ ์์กด์ฑ์ ์ ๊ฑฐํ ์ ์์์ ๋ณด์ธ๋ค. ์ด ์ํ ์ถ์ ์น๋ ํ์ต ์ด๊ธฐ์ ์๋ ดํ์ฌ 0์ ๊ฐ๊น๊ฒ ์ ์ง๋๊ธฐ ๋๋ฌธ์, ์์คํจ์๋ฅผ ๋ํจ์ผ๋ก์จ ์๊ธฐ๋ ์ฑ๋ฅ ์ ํ๊ฐ ๊ฑฐ์ ์๋ค. ์ ์ํ๋ ๊ธฐ๋ฒ์ ๋ค์ธ์ด, ๋คํ์, ์คํ์ผ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ก ์์ฑํฉ์ฑ๊ธฐ๋ฅผ ํ์ตํ๋๋ฐ ํ์ฉ๋๋ค. 15๋ช
์ ์์ฑ ์ ๋ฌธ๊ฐ๋ค์ ์ฃผ๊ด์ ์ธ ๋ฃ๊ธฐ ํ๊ฐ๋ฅผ ํตํด ์ ์ํ๋ ๊ธฐ๋ฒ์ด ํฉ์ฑ๊ธฐ์ ์คํ์ผ ์ ์ด๊ฐ๋ฅ์ฑ์ ๋์ผ ๋ฟ๋ง ์๋๋ผ ํฉ์ฑ์์ ํ์ง๊น์ง ๋์ผ ์ ์์์ ๋ณด์ธ๋ค.1 Introduction 1
1.1 Evolution of Speech Synthesis Technology 1
1.2 Attention-based Speech Synthesis Systems 2
1.2.1 Tacotron 2
1.2.2 Deep Convolutional TTS 3
1.3 Non-autoregressive Speech Synthesis Systems 6
1.3.1 Glow-TTS 6
1.3.2 SpeedySpeech 8
1.4 Outline of the thesis 8
2 Style Modeling Techniques for Speech Synthesis 13
2.1 Introduction 13
2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14
2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15
2.4 Summary 17
3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19
3.1 Introduction 19
3.2 Related Works 20
3.2.1 Gated recurrent unit 20
3.2.2 Location-sensitive attention 22
3.3 Gated Recurrent Attention 24
3.4 Experiments and results 28
3.4.1 Tacotron2 with global style tokens 28
3.4.2 Decaying guided attention 29
3.4.3 Datasets and feature processing 30
3.4.4 Evaluation methods 32
3.4.5 Evaluation results 33
3.5 Guided attention and decaying guided attention 34
3.6 Summary 35
4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41
4.1 Introduction 41
4.2 Related Works 44
4.2.1 Disentanglement Studies for Speech Synthesis 44
4.2.2 Total Correlation and Mutual Information 45
4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46
4.3 Proposed method 46
4.4 Experiments and Results 47
4.4.1 Quality and Naturalness of Speech 51
4.4.2 Speaker and style similarity 52
4.5 Summary 53
5 Conclusions 55
Bibliography 57
์ด ๋ก 67
๊ฐ์ฌ์ ๊ธ 69๋ฐ
Recommended from our members
Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation
Expressive synthesis from text is a challenging
problem. There are two issues. First, read text is often highly
expressive to convey the emotion and scenario in the text. Second,
since the expressive training speech is not always available for
different speakers, it is necessary to develop methods to share the
expressive information over speakers. This paper investigates the
approach of using very expressive, highly diverse audiobook data
from multiple speakers to build an expressive speech synthesis
system. Both of two problems are addressed by considering a
factorized framework where speaker and emotion are modelled
in separate sub-spaces of a cluster adaptive training (CAT)
parametric speech synthesis system. The sub-spaces for the
expressive state of a speaker and the characteristics of the speaker
are jointly trained using a set of audiobooks. In this work, the
expressive speech synthesis system works in two distinct modes.
In the first mode, the expressive information is given by audio
data and the adaptation method is used to extract the expressive
information in the audio data. In the second mode, the input of
the synthesis system is plain text and a full expressive synthesis
system is examined where the expressive state is predicted from
the text. In both modes, the expressive information is shared
and transplanted over different speakers. Experimental results
show that in both modes, the expressive speech synthesis method
proposed in this work significantly improves the expressiveness
of the synthetic speech for different speakers. Finally, this paper
also examines whether it is possible to predict the expressive
states from text for multiple speakers using a single model, or
whether the prediction process needs to be speaker specific.This is the accepted manuscript. The final version is available from IEEE at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6995936&filter%3DAND%28p_IS_Number%3A7055953%29
Modelling Speech Dynamics with Trajectory-HMMs
Institute for Communicating and Collaborative SystemsThe conditional independence assumption imposed by the hidden Markov models
(HMMs) makes it difficult to model temporal correlation patterns in human speech.
Traditionally, this limitation is circumvented by appending the first and second-order
regression coefficients to the observation feature vectors. Although this leads to improved
performance in recognition tasks, we argue that a straightforward use of dynamic
features in HMMs will result in an inferior model, due to the incorrect handling
of dynamic constraints. In this thesis I will show that an HMM can be transformed
into a Trajectory-HMM capable of generating smoothed output mean trajectories, by
performing a per-utterance normalisation. The resulting model can be trained by either
maximisingmodel log-likelihood or minimisingmean generation errors on the training
data. To combat the exponential growth of paths in searching, the idea of delayed path
merging is proposed and a new time-synchronous decoding algorithm built on the concept
of token-passing is designed for use in the recognition task. The Trajectory-HMM
brings a new way of sharing knowledge between speech recognition and synthesis
components, by tackling both problems in a coherent statistical framework. I evaluated
the Trajectory-HMM on two different speech tasks using the speaker-dependent
MOCHA-TIMIT database. First as a generative model to recover articulatory features
from speech signal, where the Trajectory-HMM was used in a complementary way
to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory
framework. Experiments indicate that the jointly trained acoustic-articulatory models
are more accurate (having a lower Root Mean Square error) than the separately trained
ones, and that Trajectory-HMM training results in greater accuracy compared with
conventional Baum-Welch parameter updating. In addition, the Root Mean Square
(RMS) training objective proves to be consistently better than the Maximum Likelihood
objective. However, experiment of the phone recognition task shows that the
MLE trained Trajectory-HMM, while retaining attractive properties of being a proper
generative model, tends to favour over-smoothed trajectories among competing hypothesises,
and does not perform better than a conventional HMM. We use this to
build an argument that models giving a better fit on training data may suffer a reduction
of discrimination by being too faithful to the training data. Finally, experiments
on using triphone models show that increasing modelling detail is an effective way to
leverage modelling performance with little added complexity in training
IberSPEECH 2020: XI Jornadas en Tecnologรญa del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, โIberSPEECH 2020: Speech and Language Technologies for Iberian Languagesโ, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Espaรฑola de Tecnologรญas del Habla. Universidad de Valladoli
Proceedings of the 7th Sound and Music Computing Conference
Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010