7 research outputs found
Error Reduction Network for DBLSTM-based Voice Conversion
So far, many of the deep learning approaches for voice conversion produce
good quality speech by using a large amount of training data. This paper
presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice
conversion framework that can work with a limited amount of training data. We
propose to implement a DBLSTM based average model that is trained with data
from many speakers. Then, we propose to perform adaptation with a limited
amount of target data. Last but not least, we propose an error reduction
network that can improve the voice conversion quality even further. The
proposed framework is motivated by three observations. Firstly, DBLSTM can
achieve a remarkable voice conversion by considering the long-term dependencies
of the speech utterance. Secondly, DBLSTM based average model can be easily
adapted with a small amount of data, to achieve a speech that sounds closer to
the target. Thirdly, an error reduction network can be trained with a small
amount of training data, and can improve the conversion quality effectively.
The experiments show that the proposed voice conversion framework is flexible
to work with limited training data and outperforms the traditional frameworks
in both objective and subjective evaluations.Comment: Accepted by APSIPA 201
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
Emotional voice conversion aims to convert the emotion of speech from one
state to another while preserving the linguistic content and speaker identity.
The prior studies on emotional voice conversion are mostly carried out under
the assumption that emotion is speaker-dependent. We consider that there is a
common code between speakers for emotional expression in a spoken language,
therefore, a speaker-independent mapping between emotional states is possible.
In this paper, we propose a speaker-independent emotional voice conversion
framework, that can convert anyone's emotion without the need for parallel
data. We propose a VAW-GAN based encoder-decoder structure to learn the
spectrum and prosody mapping. We perform prosody conversion by using continuous
wavelet transform (CWT) to model the temporal dependencies. We also investigate
the use of F0 as an additional input to the decoder to improve emotion
conversion performance. Experiments show that the proposed speaker-independent
framework achieves competitive results for both seen and unseen speakers.Comment: Accepted by Interspeech 202
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
Singing voice conversion aims to convert singer's voice from source to target
without changing singing content. Parallel training data is typically required
for the training of singing voice conversion system, that is however not
practical in real-life applications. Recent encoder-decoder structures, such as
variational autoencoding Wasserstein generative adversarial network (VAW-GAN),
provide an effective way to learn a mapping through non-parallel training data.
In this paper, we propose a singing voice conversion framework that is based on
VAW-GAN. We train an encoder to disentangle singer identity and singing prosody
(F0 contour) from phonetic content. By conditioning on singer identity and F0,
the decoder generates output spectral features with unseen target singer
identity, and improves the F0 rendering. Experimental results show that the
proposed framework achieves better performance than the baseline frameworks.Comment: Accepted to APSIPA ASC 202
Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN
Cross-lingual voice conversion aims to change source speaker's voice to sound
like that of target speaker, when source and target speakers speak different
languages. It relies on non-parallel training data from two different
languages, hence, is more challenging than mono-lingual voice conversion.
Previous studies on cross-lingual voice conversion mainly focus on spectral
conversion with a linear transformation for F0 transfer. However, as an
important prosodic factor, F0 is inherently hierarchical, thus it is
insufficient to just use a linear method for conversion. We propose the use of
continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides
a way to decompose a signal into different temporal scales that explain prosody
in different time resolutions. We also propose to train two CycleGAN pipelines
for spectrum and prosody mapping respectively. In this way, we eliminate the
need for parallel data of any two languages and any alignment techniques.
Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework
outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our
best knowledge, this is the first study of prosody in cross-lingual voice
conversion.Comment: Accepted to APSIPA ASC 202
VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech
Emotional voice conversion (EVC) aims to convert the emotion of speech from
one state to another while preserving the linguistic content and speaker
identity. In this paper, we study the disentanglement and recomposition of
emotional elements in speech through variational autoencoding Wasserstein
generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC
framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for
spectrum conversion, and another for prosody conversion. We train a spectral
encoder that disentangles emotion and prosody (F0) information from spectral
features; we also train a prosodic encoder that disentangles emotion modulation
of prosody (affective prosody) from linguistic prosody. At run-time, the
decoder of spectral VAW-GAN is conditioned on the output of prosodic VAW-GAN.
The vocoder takes the converted spectral and prosodic features to generate the
target emotional speech. Experiments validate the effectiveness of our proposed
method in both objective and subjective evaluations.Comment: Accepted by IEEE SLT 2021. arXiv admin note: text overlap with
arXiv:2005.0702
Expressive TTS Training with Frame and Style Reconstruction Loss
We propose a novel training strategy for Tacotron-based text-to-speech (TTS)
system to improve the expressiveness of speech. One of the key challenges in
prosody modeling is the lack of reference that makes explicit modeling
difficult. The proposed technique doesn't require prosody annotations from
training data. It doesn't attempt to model prosody explicitly either, but
rather encodes the association between input text and its prosody styles using
a Tacotron-based TTS framework. Our proposed idea marks a departure from the
style token paradigm where prosody is explicitly modeled by a bank of prosody
embeddings. The proposed training strategy adopts a combination of two
objective functions: 1) frame level reconstruction loss, that is calculated
between the synthesized and target spectral features; 2) utterance level style
reconstruction loss, that is calculated between the deep style features of
synthesized and target speech. The proposed style reconstruction loss is
formulated as a perceptual loss to ensure that utterance level speech style is
taken into consideration during training. Experiments show that the proposed
training strategy achieves remarkable performance and outperforms a
state-of-the-art baseline in both naturalness and expressiveness. To our best
knowledge, this is the first study to incorporate utterance level perceptual
quality as a loss function into Tacotron training for improved expressiveness.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin