444 research outputs found
Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech
Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performanceβspeed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version
LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices
Recent developments in speech synthesis have produced systems capable of
outcome intelligible speech, but now researchers strive to create models that
more accurately mimic human voices. One such development is the incorporation
of multiple linguistic styles in various languages and accents.
HMM-based Speech Synthesis is of great interest to many researchers, due to
its ability to produce sophisticated features with small footprint. Despite
such progress, its quality has not yet reached the level of the predominant
unit-selection approaches that choose and concatenate recordings of real
speech. Recent efforts have been made in the direction of improving these
systems.
In this paper we present the application of Long-Short Term Memory Deep
Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order
to obtain closer spectral characteristics to those of natural speech. The
results show how HMM-voices could be improved using this approach.Comment: 5 pages, 5 figure
λ₯λ¬λμ νμ©ν μ€νμΌ μ μν μμ± ν©μ± κΈ°λ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. κΉλ¨μ.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system.
In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques.
In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron.
In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly.
In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ₯λ¬λ κΈ°λ°μ μμ± ν©μ± κΈ°μ μ μ§λ λͺ λ
κ° νλ°νκ² κ°λ°λκ³ μλ€. λ₯λ¬λμ λ€μν κΈ°λ²μ μ¬μ©νμ¬ μμ± ν©μ± νμ§μ λΉμ½μ μΌλ‘ λ°μ νμ§λ§, μμ§ λ₯λ¬λ κΈ°λ°μ μμ± ν©μ±μλ μ¬λ¬ λ¬Έμ κ° μ‘΄μ¬νλ€. λ₯λ¬λ κΈ°λ°μ ν΅κ³μ νλΌλ―Έν° κΈ°λ²μ κ²½μ° μν₯ λͺ¨λΈμ deterministicν λͺ¨λΈμ νμ©νμ¬ λͺ¨λΈλ§ λ₯λ ₯μ νκ³κ° μμΌλ©°, μ’
λ¨ν λͺ¨λΈμ κ²½μ° μ€νμΌμ νννλ λ₯λ ₯κ³Ό κ°μΈν μ΄ν
μ
(attention)μ λν μ΄μκ° λμμμ΄ μ¬κΈ°λκ³ μλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν κΈ°μ‘΄μ λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ λ¨μ μ ν΄κ²°ν μλ‘μ΄ λμμ μ μνλ€.
첫 λ²μ§Έ μ κ·Όλ²μΌλ‘μ, λ΄λ΄ ν΅κ³μ νλΌλ―Έν° λ°©μμ μν₯ λͺ¨λΈλ§μ κ³ λννκΈ° μν adversarially trained variational recurrent neural network (AdVRNN) κΈ°λ²μ μ μνλ€. AdVRNN κΈ°λ²μ VRNNμ μμ± ν©μ±μ μ μ©νμ¬ μμ±μ λ³νλ₯Ό stochastic νκ³ μμΈνκ² λͺ¨λΈλ§ν μ μλλ‘ νμλ€. λν, μ λμ νμ΅μ (adversarial learning) κΈ°λ²μ νμ©νμ¬ oversmoothing λ¬Έμ λ₯Ό μ΅μν μν€λλ‘ νμλ€. μ΄λ¬ν μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄μ μν μ κ²½λ§ κΈ°λ°μ μν₯ λͺ¨λΈκ³Ό λΉκ΅νμ¬ μ±λ₯μ΄ ν₯μλ¨μ νμΈνμλ€.
λ λ²μ§Έ μ κ·Όλ²μΌλ‘μ, μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± κΈ°λ²μ μν μνΈ μ 보λ κΈ°λ°μ μλ‘μ΄ νμ΅ κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ global style token(GST) κΈ°λ°μ μ€νμΌ μμ± ν©μ± κΈ°λ²μ κ²½μ°, λΉμ§λ νμ΅μ μ¬μ©νλ―λ‘ μνλ λͺ©ν μ€νμΌμ΄ μμ΄λ μ΄λ₯Ό μ€μ μ μΌλ‘ νμ΅μν€κΈ° μ΄λ €μ λ€. μ΄λ₯Ό ν΄κ²°νκΈ° μν΄ GSTμ μΆλ ₯κ³Ό λͺ©ν μ€νμΌ μλ² λ© λ²‘ν°μ μνΈ μ 보λμ μ΅λν νλλ‘ νμ΅ μν€λ κΈ°λ²μ μ μνμλ€. μνΈ μ 보λμ μ’
λ¨ν λͺ¨λΈμ μμ€ν¨μμ μ μ©νκΈ° μν΄μ mutual information neural estimator(MINE) κΈ°λ²μ λμ
νμκ³ λ€νμ λͺ¨λΈμ ν΅ν΄ κΈ°μ‘΄μ GST κΈ°λ²μ λΉν΄ λͺ©ν μ€νμΌμ λ³΄λ€ μ€μ μ μΌλ‘ νμ΅μν¬ μ μμμ νμΈνμλ€.
μΈλ²μ§Έ μ κ·Όλ²μΌλ‘μ, κ°μΈν μ’
λ¨ν μμ± ν©μ±μ μ΄ν
μ
μΈ memory attentionμ μ μνλ€. Long-short term memory(LSTM)μ gating κΈ°μ μ sequenceλ₯Ό λͺ¨λΈλ§νλλ° λμ μ±λ₯μ 보μ¬μλ€. μ΄λ¬ν κΈ°μ μ μ΄ν
μ
μ μ μ©νμ¬ λ€μν μ€νμΌμ κ°μ§ μμ±μμλ μ΄ν
μ
μ λκΉ, λ°λ³΅ λ±μ μ΅μνν μ μλ κΈ°λ²μ μ μνλ€. λ¨μΌ νμμ κ°μ μμ± ν©μ± κΈ°λ²μ ν λλ‘ memory attentionμ μ±λ₯μ νμΈνμμΌλ©° κΈ°μ‘΄ κΈ°λ² λλΉ λ³΄λ€ μμ μ μΈ μ΄ν
μ
곑μ μ μ»μ μ μμμ νμΈνμλ€.
λ§μ§λ§ μ κ·Όλ²μΌλ‘μ, selective multi-attention (SMA)μ νμ©ν μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± μ΄ν
μ
κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ±μ μ°κ΅¬μμλ λλ
체 λ¨μΌνμμ κ²½μ°μ κ°μ λ¨μΌ μ΄ν
μ
μ μ¬μ©νμ¬ μλ€. νμ§λ§ μ€νμΌ μμ±μ κ²½μ° λ³΄λ€ λ€μν μ΄ν
μ
ννμ μꡬνλ€. μ΄λ₯Ό μν΄ λ€μ€ μ΄ν
μ
μ νμ©νμ¬ ν보λ€μ μμ±νκ³ μ΄λ₯Ό μ ν λ€νΈμν¬λ₯Ό νμ©νμ¬ μ΅μ μ μ΄ν
μ
μ μ ννλ κΈ°λ²μ μ μνλ€. SMA κΈ°λ²μ κΈ°μ‘΄μ μ΄ν
μ
κ³Όμ λΉκ΅ μ€νμ ν΅νμ¬ λ³΄λ€ λ§μ μ€νμΌμ μμ μ μΌλ‘ ννν μ μμμ νμΈνμλ€.1 Introduction 1
1.1 Background 1
1.2 Scope of thesis 3
2 Neural Speech Synthesis System 7
2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7
2.2 Overview of End-to-end Speech Synthesis System 9
2.3 Tacotron2 10
2.4 Attention Mechanism 12
2.4.1 Location Sensitive Attention 12
2.4.2 Forward Attention 13
2.4.3 Dynamic Convolution Attention 14
3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17
3.1 Introduction 17
3.2 Background 19
3.2.1 Variational Autoencoder 19
3.2.2 Variational Recurrent Neural Network 20
3.3 Speech Synthesis Using AdVRNN 22
3.3.1 AdVRNN based Acoustic Modeling 23
3.3.2 Training Procedure 24
3.4 Experiments 25
3.4.1 Objective performance evaluation 28
3.4.2 Subjective performance evaluation 29
3.5 Summary 29
4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31
4.1 Introduction 31
4.2 Background 33
4.2.1 Mutual Information 33
4.2.2 Mutual Information Neural Estimator 34
4.2.3 Global Style Token 34
4.3 Style Token end-to-end speech synthesis using MINE 35
4.4 Experiments 36
4.5 Summary 38
5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45
5.1 Introduction 45
5.2 BACKGROUND 48
5.3 Memory Attention 49
5.4 Experiments 52
5.4.1 Experiments on Single Speaker Speech Synthesis 53
5.4.2 Experiments on Emotional Speech Synthesis 56
5.5 Summary 59
6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63
6.1 Introduction 63
6.2 BACKGROUND 65
6.3 Selective multi-attention model 66
6.4 EXPERIMENTS 67
6.4.1 Multi-speaker speech synthesis experiments 68
6.4.2 Experiments on Emotional Speech Synthesis 73
6.5 Summary 77
7 Conclusions 79
Bibliography 83
μμ½ 93
κ°μ¬μ κΈ 95Docto
Tacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such
as a text analysis frontend, an acoustic model and an audio synthesis module.
Building these components often requires extensive domain expertise and may
contain brittle design choices. In this paper, we present Tacotron, an
end-to-end generative text-to-speech model that synthesizes speech directly
from characters. Given pairs, the model can be trained completely
from scratch with random initialization. We present several key techniques to
make the sequence-to-sequence framework perform well for this challenging task.
Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English,
outperforming a production parametric system in terms of naturalness. In
addition, since Tacotron generates speech at the frame level, it's
substantially faster than sample-level autoregressive methods.Comment: Submitted to Interspeech 2017. v2 changed paper title to be
consistent with our conference submission (no content change other than typo
fixes
- β¦