560 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to
synthesize speech with a specific reference timbre or style that is never
trained in the target language. It encounters the following challenges: 1)
timbre and pronunciation are correlated since multilingual speech of a specific
speaker is usually hard to obtain; 2) style and pronunciation are mixed because
the speech style contains language-agnostic and language-specific parts. To
address these challenges, we propose GenerTTS, which mainly includes the
following works: 1) we elaborately design a HuBERT-based information bottleneck
to disentangle timbre and pronunciation/style; 2) we minimize the mutual
information between style and language to discard the language-specific
information in the style embedding. The experiments indicate that GenerTTS
outperforms baseline systems in terms of style similarity and pronunciation
accuracy, and enables cross-lingual timbre and style generalization.Comment: Accepted by INTERSPEECH 202
Bootstrapping Non-Parallel Voice Conversion From Speaker-Adaptive Text-to-Speech
Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a
similar objective, generating speech with a target voice. However, they are
usually developed independently under vastly different frameworks. In this
paper, we propose a methodology to bootstrap a VC system from a pretrained
speaker-adaptive TTS model and unify the techniques as well as the
interpretations of these two tasks. Moreover by offloading the heavy data
demand to the training stage of the TTS model, our VC system can be built using
a small amount of target speaker speech data. It also opens up the possibility
of using speech in a foreign unseen language to build the system. Our
subjective evaluations show that the proposed framework is able to not only
achieve competitive performance in the standard intra-language scenario but
also adapt and convert using speech utterances in an unseen language.Comment: Accepted for IEEE ASRU 201
- …