154 research outputs found
A Fully Time-domain Neural Model for Subband-based Speech Synthesizer
This paper introduces a deep neural network model for subband-based speech
synthesizer. The model benefits from the short bandwidth of the subband signals
to reduce the complexity of the time-domain speech generator. We employed the
multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into
subbands in time domain. Inspired from the WaveNet, a convolutional neural
network (CNN) model predicts subband speech signals fully in time domain. Due
to the short bandwidth of the subbands, a simple network architecture is enough
to train the simple patterns of the subbands accurately. In the ground truth
experiments with teacher-forcing, the subband synthesizer outperforms the
fullband model significantly in terms of both subjective and objective
measures. In addition, by conditioning the model on the phoneme sequence using
a pronunciation dictionary, we have achieved the fully time-domain neural model
for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end.
The generated speech of the subband TTS shows comparable quality as the
fullband one with a slighter network architecture for each subband.Comment: 5 pages, 3 figur
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
λ₯λ¬λμ νμ©ν μ€νμΌ μ μν μμ± ν©μ± κΈ°λ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. κΉλ¨μ.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system.
In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques.
In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron.
In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly.
In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.λ₯λ¬λ κΈ°λ°μ μμ± ν©μ± κΈ°μ μ μ§λ λͺ λ
κ° νλ°νκ² κ°λ°λκ³ μλ€. λ₯λ¬λμ λ€μν κΈ°λ²μ μ¬μ©νμ¬ μμ± ν©μ± νμ§μ λΉμ½μ μΌλ‘ λ°μ νμ§λ§, μμ§ λ₯λ¬λ κΈ°λ°μ μμ± ν©μ±μλ μ¬λ¬ λ¬Έμ κ° μ‘΄μ¬νλ€. λ₯λ¬λ κΈ°λ°μ ν΅κ³μ νλΌλ―Έν° κΈ°λ²μ κ²½μ° μν₯ λͺ¨λΈμ deterministicν λͺ¨λΈμ νμ©νμ¬ λͺ¨λΈλ§ λ₯λ ₯μ νκ³κ° μμΌλ©°, μ’
λ¨ν λͺ¨λΈμ κ²½μ° μ€νμΌμ νννλ λ₯λ ₯κ³Ό κ°μΈν μ΄ν
μ
(attention)μ λν μ΄μκ° λμμμ΄ μ¬κΈ°λκ³ μλ€. λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν κΈ°μ‘΄μ λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ λ¨μ μ ν΄κ²°ν μλ‘μ΄ λμμ μ μνλ€.
첫 λ²μ§Έ μ κ·Όλ²μΌλ‘μ, λ΄λ΄ ν΅κ³μ νλΌλ―Έν° λ°©μμ μν₯ λͺ¨λΈλ§μ κ³ λννκΈ° μν adversarially trained variational recurrent neural network (AdVRNN) κΈ°λ²μ μ μνλ€. AdVRNN κΈ°λ²μ VRNNμ μμ± ν©μ±μ μ μ©νμ¬ μμ±μ λ³νλ₯Ό stochastic νκ³ μμΈνκ² λͺ¨λΈλ§ν μ μλλ‘ νμλ€. λν, μ λμ νμ΅μ (adversarial learning) κΈ°λ²μ νμ©νμ¬ oversmoothing λ¬Έμ λ₯Ό μ΅μν μν€λλ‘ νμλ€. μ΄λ¬ν μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄μ μν μ κ²½λ§ κΈ°λ°μ μν₯ λͺ¨λΈκ³Ό λΉκ΅νμ¬ μ±λ₯μ΄ ν₯μλ¨μ νμΈνμλ€.
λ λ²μ§Έ μ κ·Όλ²μΌλ‘μ, μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± κΈ°λ²μ μν μνΈ μ 보λ κΈ°λ°μ μλ‘μ΄ νμ΅ κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ global style token(GST) κΈ°λ°μ μ€νμΌ μμ± ν©μ± κΈ°λ²μ κ²½μ°, λΉμ§λ νμ΅μ μ¬μ©νλ―λ‘ μνλ λͺ©ν μ€νμΌμ΄ μμ΄λ μ΄λ₯Ό μ€μ μ μΌλ‘ νμ΅μν€κΈ° μ΄λ €μ λ€. μ΄λ₯Ό ν΄κ²°νκΈ° μν΄ GSTμ μΆλ ₯κ³Ό λͺ©ν μ€νμΌ μλ² λ© λ²‘ν°μ μνΈ μ 보λμ μ΅λν νλλ‘ νμ΅ μν€λ κΈ°λ²μ μ μνμλ€. μνΈ μ 보λμ μ’
λ¨ν λͺ¨λΈμ μμ€ν¨μμ μ μ©νκΈ° μν΄μ mutual information neural estimator(MINE) κΈ°λ²μ λμ
νμκ³ λ€νμ λͺ¨λΈμ ν΅ν΄ κΈ°μ‘΄μ GST κΈ°λ²μ λΉν΄ λͺ©ν μ€νμΌμ λ³΄λ€ μ€μ μ μΌλ‘ νμ΅μν¬ μ μμμ νμΈνμλ€.
μΈλ²μ§Έ μ κ·Όλ²μΌλ‘μ, κ°μΈν μ’
λ¨ν μμ± ν©μ±μ μ΄ν
μ
μΈ memory attentionμ μ μνλ€. Long-short term memory(LSTM)μ gating κΈ°μ μ sequenceλ₯Ό λͺ¨λΈλ§νλλ° λμ μ±λ₯μ 보μ¬μλ€. μ΄λ¬ν κΈ°μ μ μ΄ν
μ
μ μ μ©νμ¬ λ€μν μ€νμΌμ κ°μ§ μμ±μμλ μ΄ν
μ
μ λκΉ, λ°λ³΅ λ±μ μ΅μνν μ μλ κΈ°λ²μ μ μνλ€. λ¨μΌ νμμ κ°μ μμ± ν©μ± κΈ°λ²μ ν λλ‘ memory attentionμ μ±λ₯μ νμΈνμμΌλ©° κΈ°μ‘΄ κΈ°λ² λλΉ λ³΄λ€ μμ μ μΈ μ΄ν
μ
곑μ μ μ»μ μ μμμ νμΈνμλ€.
λ§μ§λ§ μ κ·Όλ²μΌλ‘μ, selective multi-attention (SMA)μ νμ©ν μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ± μ΄ν
μ
κΈ°λ²μ μ μνλ€. κΈ°μ‘΄μ μ€νμΌ μ μν μ’
λ¨ν μμ± ν©μ±μ μ°κ΅¬μμλ λλ
체 λ¨μΌνμμ κ²½μ°μ κ°μ λ¨μΌ μ΄ν
μ
μ μ¬μ©νμ¬ μλ€. νμ§λ§ μ€νμΌ μμ±μ κ²½μ° λ³΄λ€ λ€μν μ΄ν
μ
ννμ μꡬνλ€. μ΄λ₯Ό μν΄ λ€μ€ μ΄ν
μ
μ νμ©νμ¬ ν보λ€μ μμ±νκ³ μ΄λ₯Ό μ ν λ€νΈμν¬λ₯Ό νμ©νμ¬ μ΅μ μ μ΄ν
μ
μ μ ννλ κΈ°λ²μ μ μνλ€. SMA κΈ°λ²μ κΈ°μ‘΄μ μ΄ν
μ
κ³Όμ λΉκ΅ μ€νμ ν΅νμ¬ λ³΄λ€ λ§μ μ€νμΌμ μμ μ μΌλ‘ ννν μ μμμ νμΈνμλ€.1 Introduction 1
1.1 Background 1
1.2 Scope of thesis 3
2 Neural Speech Synthesis System 7
2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7
2.2 Overview of End-to-end Speech Synthesis System 9
2.3 Tacotron2 10
2.4 Attention Mechanism 12
2.4.1 Location Sensitive Attention 12
2.4.2 Forward Attention 13
2.4.3 Dynamic Convolution Attention 14
3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17
3.1 Introduction 17
3.2 Background 19
3.2.1 Variational Autoencoder 19
3.2.2 Variational Recurrent Neural Network 20
3.3 Speech Synthesis Using AdVRNN 22
3.3.1 AdVRNN based Acoustic Modeling 23
3.3.2 Training Procedure 24
3.4 Experiments 25
3.4.1 Objective performance evaluation 28
3.4.2 Subjective performance evaluation 29
3.5 Summary 29
4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31
4.1 Introduction 31
4.2 Background 33
4.2.1 Mutual Information 33
4.2.2 Mutual Information Neural Estimator 34
4.2.3 Global Style Token 34
4.3 Style Token end-to-end speech synthesis using MINE 35
4.4 Experiments 36
4.5 Summary 38
5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45
5.1 Introduction 45
5.2 BACKGROUND 48
5.3 Memory Attention 49
5.4 Experiments 52
5.4.1 Experiments on Single Speaker Speech Synthesis 53
5.4.2 Experiments on Emotional Speech Synthesis 56
5.5 Summary 59
6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63
6.1 Introduction 63
6.2 BACKGROUND 65
6.3 Selective multi-attention model 66
6.4 EXPERIMENTS 67
6.4.1 Multi-speaker speech synthesis experiments 68
6.4.2 Experiments on Emotional Speech Synthesis 73
6.5 Summary 77
7 Conclusions 79
Bibliography 83
μμ½ 93
κ°μ¬μ κΈ 95Docto
Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis
Neural Text-to-speech (TTS) synthesis is a powerful technology that can
generate speech using neural networks. One of the most remarkable features of
TTS synthesis is its capability to produce speech in the voice of different
speakers. This paper introduces voice cloning and speech synthesis
https://pypi.org/project/voice-cloning/ an open-source python package for
helping speech disorders to communicate more effectively as well as for
professionals seeking to integrate voice cloning or speech synthesis
capabilities into their projects. This package aims to generate synthetic
speech that sounds like the natural voice of an individual, but it does not
replace the natural human voice. The architecture of the system comprises a
speaker verification system, a synthesizer, a vocoder, and noise reduction.
Speaker verification system trained on a varied set of speakers to achieve
optimal generalization performance without relying on transcriptions.
Synthesizer is trained using both audio and transcriptions that generate Mel
spectrogram from a text and vocoder which converts the generated Mel
Spectrogram into corresponding audio signal. Then the audio signal is processed
by a noise reduction algorithm to eliminate unwanted noise and enhance speech
clarity. The performance of synthesized speech from seen and unseen speakers
are then evaluated using subjective and objective evaluation such as Mean
Opinion Score (MOS), Gross Pitch Error (GPE), and Spectral distortion (SD). The
model can create speech in distinct voices by including speaker characteristics
that are chosen randomly
- β¦