11,122 research outputs found
λ₯λ¬λμ νμ©ν μ€νμΌ μ μν μμ± ν©μ± κΈ°λ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. κΉλ¨μ.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system.
In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques.
In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron.
In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly.
In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ₯λ¬λ κΈ°λ°μ μμ± ν©μ± κΈ°μ μ μ§λ λͺ λ
κ° νλ°νκ² κ°λ°λκ³ μλ€. λ₯λ¬λμ λ€μν κΈ°λ²μ μ¬μ©νμ¬ μμ± ν©μ± νμ§μ λΉμ½μ μΌλ‘ λ°μ νμ§λ§, μμ§ λ₯λ¬λ κΈ°λ°μ μμ± ν©μ±μλ μ¬λ¬ λ¬Έμ κ° μ‘΄μ¬νλ€. λ₯λ¬λ κΈ°λ°μ ν΅κ³μ νλΌλ―Έν° κΈ°λ²μ κ²½μ° μν₯ λͺ¨λΈμ deterministicν λͺ¨λΈμ νμ©νμ¬ λͺ¨λΈλ§ λ₯λ ₯μ νκ³κ° μμΌλ©°, μ’
λ¨ν λͺ¨λΈμ κ²½μ° μ€νμΌμ νννλ λ₯λ ₯κ³Ό κ°μΈν μ΄ν
μ
(attention)μ λν μ΄μκ° λμμμ΄ μ¬κΈ°λκ³ μλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν κΈ°μ‘΄μ λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ λ¨μ μ ν΄κ²°ν μλ‘μ΄ λμμ μ μνλ€.
첫 λ²μ§Έ μ κ·Όλ²μΌλ‘μ, λ΄λ΄ ν΅κ³μ νλΌλ―Έν° λ°©μμ μν₯ λͺ¨λΈλ§μ κ³ λννκΈ° μν adversarially trained variational recurrent neural network (AdVRNN) κΈ°λ²μ μ μνλ€. AdVRNN κΈ°λ²μ VRNNμ μμ± ν©μ±μ μ μ©νμ¬ μμ±μ λ³νλ₯Ό stochastic νκ³ μμΈνκ² λͺ¨λΈλ§ν μ μλλ‘ νμλ€. λν, μ λμ νμ΅μ (adversarial learning) κΈ°λ²μ νμ©νμ¬ oversmoothing λ¬Έμ λ₯Ό μ΅μν μν€λλ‘ νμλ€. μ΄λ¬ν μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄μ μν μ κ²½λ§ κΈ°λ°μ μν₯ λͺ¨λΈκ³Ό λΉκ΅νμ¬ μ±λ₯μ΄ ν₯μλ¨μ νμΈνμλ€.
λ λ²μ§Έ μ κ·Όλ²μΌλ‘μ, μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± κΈ°λ²μ μν μνΈ μ 보λ κΈ°λ°μ μλ‘μ΄ νμ΅ κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ global style token(GST) κΈ°λ°μ μ€νμΌ μμ± ν©μ± κΈ°λ²μ κ²½μ°, λΉμ§λ νμ΅μ μ¬μ©νλ―λ‘ μνλ λͺ©ν μ€νμΌμ΄ μμ΄λ μ΄λ₯Ό μ€μ μ μΌλ‘ νμ΅μν€κΈ° μ΄λ €μ λ€. μ΄λ₯Ό ν΄κ²°νκΈ° μν΄ GSTμ μΆλ ₯κ³Ό λͺ©ν μ€νμΌ μλ² λ© λ²‘ν°μ μνΈ μ 보λμ μ΅λν νλλ‘ νμ΅ μν€λ κΈ°λ²μ μ μνμλ€. μνΈ μ 보λμ μ’
λ¨ν λͺ¨λΈμ μμ€ν¨μμ μ μ©νκΈ° μν΄μ mutual information neural estimator(MINE) κΈ°λ²μ λμ
νμκ³ λ€νμ λͺ¨λΈμ ν΅ν΄ κΈ°μ‘΄μ GST κΈ°λ²μ λΉν΄ λͺ©ν μ€νμΌμ λ³΄λ€ μ€μ μ μΌλ‘ νμ΅μν¬ μ μμμ νμΈνμλ€.
μΈλ²μ§Έ μ κ·Όλ²μΌλ‘μ, κ°μΈν μ’
λ¨ν μμ± ν©μ±μ μ΄ν
μ
μΈ memory attentionμ μ μνλ€. Long-short term memory(LSTM)μ gating κΈ°μ μ sequenceλ₯Ό λͺ¨λΈλ§νλλ° λμ μ±λ₯μ 보μ¬μλ€. μ΄λ¬ν κΈ°μ μ μ΄ν
μ
μ μ μ©νμ¬ λ€μν μ€νμΌμ κ°μ§ μμ±μμλ μ΄ν
μ
μ λκΉ, λ°λ³΅ λ±μ μ΅μνν μ μλ κΈ°λ²μ μ μνλ€. λ¨μΌ νμμ κ°μ μμ± ν©μ± κΈ°λ²μ ν λλ‘ memory attentionμ μ±λ₯μ νμΈνμμΌλ©° κΈ°μ‘΄ κΈ°λ² λλΉ λ³΄λ€ μμ μ μΈ μ΄ν
μ
곑μ μ μ»μ μ μμμ νμΈνμλ€.
λ§μ§λ§ μ κ·Όλ²μΌλ‘μ, selective multi-attention (SMA)μ νμ©ν μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± μ΄ν
μ
κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ±μ μ°κ΅¬μμλ λλ
체 λ¨μΌνμμ κ²½μ°μ κ°μ λ¨μΌ μ΄ν
μ
μ μ¬μ©νμ¬ μλ€. νμ§λ§ μ€νμΌ μμ±μ κ²½μ° λ³΄λ€ λ€μν μ΄ν
μ
ννμ μꡬνλ€. μ΄λ₯Ό μν΄ λ€μ€ μ΄ν
μ
μ νμ©νμ¬ ν보λ€μ μμ±νκ³ μ΄λ₯Ό μ ν λ€νΈμν¬λ₯Ό νμ©νμ¬ μ΅μ μ μ΄ν
μ
μ μ ννλ κΈ°λ²μ μ μνλ€. SMA κΈ°λ²μ κΈ°μ‘΄μ μ΄ν
μ
κ³Όμ λΉκ΅ μ€νμ ν΅νμ¬ λ³΄λ€ λ§μ μ€νμΌμ μμ μ μΌλ‘ ννν μ μμμ νμΈνμλ€.1 Introduction 1
1.1 Background 1
1.2 Scope of thesis 3
2 Neural Speech Synthesis System 7
2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7
2.2 Overview of End-to-end Speech Synthesis System 9
2.3 Tacotron2 10
2.4 Attention Mechanism 12
2.4.1 Location Sensitive Attention 12
2.4.2 Forward Attention 13
2.4.3 Dynamic Convolution Attention 14
3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17
3.1 Introduction 17
3.2 Background 19
3.2.1 Variational Autoencoder 19
3.2.2 Variational Recurrent Neural Network 20
3.3 Speech Synthesis Using AdVRNN 22
3.3.1 AdVRNN based Acoustic Modeling 23
3.3.2 Training Procedure 24
3.4 Experiments 25
3.4.1 Objective performance evaluation 28
3.4.2 Subjective performance evaluation 29
3.5 Summary 29
4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31
4.1 Introduction 31
4.2 Background 33
4.2.1 Mutual Information 33
4.2.2 Mutual Information Neural Estimator 34
4.2.3 Global Style Token 34
4.3 Style Token end-to-end speech synthesis using MINE 35
4.4 Experiments 36
4.5 Summary 38
5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45
5.1 Introduction 45
5.2 BACKGROUND 48
5.3 Memory Attention 49
5.4 Experiments 52
5.4.1 Experiments on Single Speaker Speech Synthesis 53
5.4.2 Experiments on Emotional Speech Synthesis 56
5.5 Summary 59
6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63
6.1 Introduction 63
6.2 BACKGROUND 65
6.3 Selective multi-attention model 66
6.4 EXPERIMENTS 67
6.4.1 Multi-speaker speech synthesis experiments 68
6.4.2 Experiments on Emotional Speech Synthesis 73
6.5 Summary 77
7 Conclusions 79
Bibliography 83
μμ½ 93
κ°μ¬μ κΈ 95Docto
Affective social anthropomorphic intelligent system
Human conversational styles are measured by the sense of humor, personality,
and tone of voice. These characteristics have become essential for
conversational intelligent virtual assistants. However, most of the
state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret
the affective semantics of human voices. This research proposes an
anthropomorphic intelligent system that can hold a proper human-like
conversation with emotion and personality. A voice style transfer method is
also proposed to map the attributes of a specific emotion. Initially, the
frequency domain data (Mel-Spectrogram) is created by converting the temporal
audio wave data, which comprises discrete patterns for audio features such as
notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used
to predict seven different affective states from voice. The voice is also fed
parallelly to the deep-speech, an RNN model that generates the text
transcription from the spectrogram. Then the transcripted text is transferred
to the multi-domain conversation agent using blended skill talk,
transformer-based retrieve-and-generate generation strategy, and beam-search
decoding, and an appropriate textual response is generated. The system learns
an invertible mapping of data to a latent space that can be manipulated and
generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to
voice synthesize and style transfer. Finally, the waveform is generated using
WaveGlow from the spectrogram. The outcomes of the studies we conducted on
individual models were auspicious. Furthermore, users who interacted with the
system provided positive feedback, demonstrating the system's effectiveness.Comment: Multimedia Tools and Applications (2023
μ‘°κ±΄λΆ μκΈ°νκ·ν μΈκ³΅μ κ²½λ§μ μ΄μ©ν μ μ΄ κ°λ₯ν κ°μ°½ μμ± ν©μ±
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ§λ₯μ 보μ΅ν©νκ³Ό, 2022. 8. μ΄κ΅κ΅¬.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.κ°μ°½ ν©μ±μ μ£Όμ΄μ§ μ
λ ₯ μ
보λ‘λΆν° μμ°μ€λ¬μ΄ κ°μ°½ μμ±μ ν©μ±ν΄λ΄λ κ²μ λͺ©νλ‘ νλ€. κ°μ°½ ν©μ± μμ€ν
μ μμ
μ μ λΉμ©μ ν¬κ² μ€μΌ μ μμ λΏλ§ μλλΌ μ°½μμμ μλλ₯Ό λ³΄λ€ μ½κ³ νΈλ¦¬νκ² λ°μν μ μλλ‘ λλλ€. νμ§λ§ μ΄λ¬ν μμ€ν
μ μ€κ³λ₯Ό μν΄μλ λ€μ μΈ κ°μ§μ λμ μ μΈ μꡬμ¬νμ΄ μ‘΄μ¬νλ€. 1) κ°μ°½μ μ΄λ£¨λ λ€μν μμλ₯Ό λ
립μ μΌλ‘ μ μ΄ν μ μμ΄μΌ νλ€. 2) λμ νμ§ μμ€ λ° μ¬μ©μ±μ λ¬μ±ν΄μΌ νλ€. 3) μΆ©λΆν νλ ¨ λ°μ΄ν°λ₯Ό ν보νκΈ° μ΄λ ΅λ€. μ΄λ¬ν λ¬Έμ μ λμνκΈ° μν΄ μ°λ¦¬λ λνμ μΈ μμ± μμ± λͺ¨λΈλ§ κΈ°λ²μΈ μμ€-νν° μ΄λ‘ μ μ£Όλͺ©νμλ€. κ°μ°½ μ νΈλ₯Ό μμ μ 보μ ν΄λΉνλ μμ€μ λ°μ μ 보μ ν΄λΉνλ νν°μ ν©μ±κ³±μΌλ‘ μ μνκ³ , μ΄λ₯Ό κ°κ° λ
립μ μΌλ‘ λͺ¨λΈλ§ν μ μλ ꡬ쑰λ₯Ό μ€κ³νμ¬ νλ ¨ λ°μ΄ν° ν¨μ¨μ±κ³Ό μ μ΄ κ°λ₯μ±μ λμμ ν보νκ³ μ νμλ€. λν μ°λ¦¬λ λ°μ, μμ , νμ λ± μ‘°κ±΄λΆ μ
λ ₯μ΄ μ£Όμ΄μ§ μν©μμ μκ³μ΄ λ°μ΄ν°λ₯Ό ν¨κ³Όμ μΌλ‘ λͺ¨λΈλ§νκΈ° μνμ¬ μ‘°κ±΄λΆ μκΈ°νκ· λͺ¨λΈ κΈ°λ°μ μ¬μΈ΅μ κ²½λ§μ νμ©νμλ€. λ§μ§λ§μΌλ‘ λ μ΄λΈλ§ λμ΄μμ§ μμ μμ
μ ννμ λͺ¨λΈλ§ν μ μλλ‘ μ°λ¦¬λ μκΈ°μ§λνμ΅ κΈ°λ°μ μ€νμΌ λͺ¨λΈλ§ κΈ°λ²μ μ μνλ€. μ°λ¦¬λ μ μν λͺ¨λΈμ΄ λ°μ, μμ , μμ, μ°½λ², νν λ± λ€μν μμλ₯Ό μ μ°νκ² μ μ΄νλ©΄μλ μ€μ κ°μ°½κ³Ό ꡬλΆμ΄ μ΄λ €μ΄ μμ€μ κ³ νμ§ κ°μ°½ ν©μ±μ΄ κ°λ₯ν¨μ νμΈνλ€. λμκ° μ€μ μμ
μ μ κ³Όμ μ κ³ λ €ν μμ± λ° μμ νλ μμν¬λ₯Ό μ μνμκ³ , μλ‘μ΄ λͺ©μ리 λμμΈ, κ΅μ°¨ μμ± λ± μ°½μμμ μμλ ₯κ³Ό νκ³λ₯Ό λν μ μλ μμ©μ΄ κ°λ₯ν¨μ νμΈνλ€.1 Introduction 1
1.1 Motivation 1
1.2 Problems in singing voice synthesis 4
1.3 Task of interest 8
1.3.1 Single-singer SVS 9
1.3.2 Multi-singer SVS 10
1.3.3 Expressive SVS 11
1.4 Contribution 11
2 Background 13
2.1 Singing voice 14
2.2 Source-filter theory 18
2.3 Autoregressive model 21
2.4 Related works 22
2.4.1 Speech synthesis 25
2.4.2 Singing voice synthesis 29
3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31
3.1 Introduction 31
3.2 Related work 33
3.3 Proposed method 35
3.3.1 Input representation 35
3.3.2 Mel-synthesis network 36
3.3.3 Super-resolution network 38
3.4 Experiments 42
3.4.1 Dataset 42
3.4.2 Training 42
3.4.3 Evaluation 43
3.4.4 Analysis on generated spectrogram 46
3.5 Discussion 49
3.5.1 Limitations of input representation 49
3.5.2 Advantages of using super-resolution network 53
3.6 Conclusion 55
4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57
4.1Introduction 57
4.2 Related works 59
4.2.1 Multi-singer SVS system 60
4.3 Proposed Method 60
4.3.1 Singer identity encoder 62
4.3.2 Disentangling timbre & singing style 64
4.4 Experiment 64
4.4.1 Dataset and preprocessing 64
4.4.2 Training & inference 65
4.4.3 Analysis on generated spectrogram 65
4.4.4 Listening test 66
4.4.5 Timbre & style classification test 68
4.5 Discussion 70
4.5.1 Query audio selection strategy for singer identity encoder 70
4.5.2 Few-shot adaptation 72
4.6 Conclusion 74
5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77
5.1 Introduction 77
5.2 Related work 79
5.3 Proposed method 80
5.3.1 Local style token module 80
5.3.2 Dual-path pitch encoder 85
5.3.3 Bandwidth extension vocoder 85
5.4 Experiment 86
5.4.1 Dataset 86
5.4.2 Training 86
5.4.3 Qualitative evaluation 87
5.4.4 Dual-path reconstruction analysis 89
5.4.5 Qualitative analysis 90
5.5 Discussion 93
5.5.1 Difference between midi pitch and f0 93
5.5.2 Considerations for use in the actual music production process 94
5.6 Conclusion 95
6 Conclusion 97
6.1 Thesis summary 97
6.2 Limitations and future work 99
6.2.1 Improvements to a faster and robust system 99
6.2.2 Explainable and intuitive controllability 101
6.2.3 Extensions to common speech synthesis tools 103
6.2.4 Towards a collaborative and creative tool 104λ°
Generative models for music using transformer architectures
openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language.
A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered.
The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions.
Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production.
This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN.
Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language.
A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered.
The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions.
Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production.
This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN.
Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments
Controllable Accented Text-to-Speech Synthesis
Accented text-to-speech (TTS) synthesis seeks to generate speech with an
accent (L2) as a variant of the standard version (L1). Accented TTS synthesis
is challenging as L2 is different from L1 in both in terms of phonetic
rendering and prosody pattern. Furthermore, there is no easy solution to the
control of the accent intensity in an utterance. In this work, we propose a
neural TTS architecture, that allows us to control the accent and its intensity
during inference. This is achieved through three novel mechanisms, 1) an accent
variance adaptor to model the complex accent variance with three prosody
controlling factors, namely pitch, energy and duration; 2) an accent intensity
modeling strategy to quantify the accent intensity; 3) a consistency constraint
module to encourage the TTS system to render the expected accent intensity at a
fine level. Experiments show that the proposed system attains superior
performance to the baseline models in terms of accent rendering and intensity
control. To our best knowledge, this is the first study of accented TTS
synthesis with explicit intensity control.Comment: To be submitted for possible journal publicatio
- β¦