71 research outputs found
Parallel and Limited Data Voice Conversion Using Stochastic Variational Deep Kernel Learning
Typically, voice conversion is regarded as an engineering problem with
limited training data. The reliance on massive amounts of data hinders the
practical applicability of deep learning approaches, which have been
extensively researched in recent years. On the other hand, statistical methods
are effective with limited data but have difficulties in modelling complex
mapping functions. This paper proposes a voice conversion method that works
with limited data and is based on stochastic variational deep kernel learning
(SVDKL). At the same time, SVDKL enables the use of deep neural networks'
expressive capability as well as the high flexibility of the Gaussian process
as a Bayesian and non-parametric method. When the conventional kernel is
combined with the deep neural network, it is possible to estimate non-smooth
and more complex functions. Furthermore, the model's sparse variational
Gaussian process solves the scalability problem and, unlike the exact Gaussian
process, allows for the learning of a global mapping function for the entire
acoustic space. One of the most important aspects of the proposed scheme is
that the model parameters are trained using marginal likelihood optimization,
which considers both data fitting and model complexity. Considering the
complexity of the model reduces the amount of training data by increasing the
resistance to overfitting. To evaluate the proposed scheme, we examined the
model's performance with approximately 80 seconds of training data. The results
indicated that our method obtained a higher mean opinion score, smaller
spectral distortion, and better preference tests than the compared methods
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
- …