92 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion
This paper presents a refinement framework of WaveNet vocoders for
variational autoencoder (VAE) based voice conversion (VC), which reduces the
quality distortion caused by the mismatch between the training data and testing
data. Conventional WaveNet vocoders are trained with natural acoustic features
but conditioned on the converted features in the conversion stage for VC, and
such a mismatch often causes significant quality and similarity degradation. In
this work, we take advantage of the particular structure of VAEs to refine
WaveNet vocoders with the self-reconstructed features generated by VAE, which
are of similar characteristics with the converted features while having the
same temporal structure with the target natural features. We analyze these
features and show that the self-reconstructed features are similar to the
converted features. Objective and subjective experimental results demonstrate
the effectiveness of our proposed framework.Comment: 5 pages, 7 figures, 1 table. Accepted to EUSIPCO 201
An improved StarGAN for emotional voice conversion: enhancing voice quality and data augmentation
Emotional Voice Conversion (EVC) aims to convert the emotional style of a
source speech signal to a target style while preserving its content and speaker
identity information. Previous emotional conversion studies do not disentangle
emotional information from emotion-independent information that should be
preserved, thus transforming it all in a monolithic manner and generating audio
of low quality, with linguistic distortions. To address this distortion
problem, we propose a novel StarGAN framework along with a two-stage training
process that separates emotional features from those independent of emotion by
using an autoencoder with two encoders as the generator of the Generative
Adversarial Network (GAN). The proposed model achieves favourable results in
both the objective evaluation and the subjective evaluation in terms of
distortion, which reveals that the proposed model can effectively reduce
distortion. Furthermore, in data augmentation experiments for end-to-end speech
emotion recognition, the proposed StarGAN model achieves an increase of 2% in
Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which
indicates that the proposed model is more valuable for data augmentation.Comment: Accepted by Interspeech 202
- …